views:

366

answers:

5

Hi ,

I am trying to strip out all the links and text between anchors tags from a html string as below:

 string LINK_TAG_PATTERN = "/<a\b[^>]*>(.*?)<\\/a>";

 htmltext = Regex.Replace(htmltext, LINK_TAG_PATTERN, string.Empty);

This is not working anyone have ideas why?

Thanks a lot,

Edit: the regex was from this link http://stackoverflow.com/questions/1991337/extract-text-and-links-from-html-using-regular-expressions

+2  A: 

I recommend Expresso to troubleshoot regular expressions. You can find a library of regular expressions here.

You might consider using javascript to walk the DOM tree for your replacements instead of regex.

Dave Swersky
+3  A: 

Use an HTML Parser and not Regular Expressions to parse HTML.

HTML Agiliity Pack

RC
A: 

Conceptually, this only strips links of a very special kind (e.g. your regex does not match upper-case A which is perfectly valid in HTML: <A ...>bla</A>. The replacement wouldn't work for javascript links either. Is your code relevant to user security?

Thorsten79
+2  A: 

Problems in your string: Unnecessary slash at the beginning (that's Perl syntax), unescaped backslash (\b), unnecessary escaped backslash (\\).

So, if it has to be a Regex, taking all warnings into account that enough other people have linked to, try

string LINK_TAG_PATTERN = @"<a\b[^>]*>(.*?)</a>";
htmltext = Regex.Replace(htmltext, LINK_TAG_PATTERN, string.Empty, RegexOptions.IgnoreCase);

The \b is necessary to prevent other tags that start with a from matching.

Tim Pietzcker
+1  A: 
string LINK_TAG_PATTERN = @"(<a\s+[^>]*>)(.*?)(</a>)";

htmltext = Regex.Replace(htmltext, LINK_TAG_PATTERN, "$1$3", RegexOptions.IgnoreCase);
Igor Korkhov