I'm creating a CLR user defined function in Sql Server 2005 to do some cleaning in a lot of database tables.
The task is to remove almost all tags except links ('a'
tags and their 'href'
attributes). So I divided the problem in two stages. 1. creating a user defined sql server function, and 2. creating a sql server script to do the update to all the involved tables calling the clr function.
For the user defined function and given the restricted environment, I prefer to do this with native libraries. That means, not using the Html Agility Pack, for example.
In javascript this regular expression, apparently does the right job:
<\s*a[^>]\s*href=(.*)>(.*?)<\s*/\s*a>
At least, according to http://www.pagecolumn.com/tool/regtest.htm
But, I don't know how to translate that (especially, the capturing groups part) into C# code to use the text as part of the output.
For instance, if the input is : <a href="http://example.com">some text</a>
how to save the text "http://example.com"
and "some text"
as part of the output in C# code and at the same time stripping any other possible html tag (and their content)?