tags:

views:

1121

answers:

4

Hi,

I need to do a search and replace on long text strings. I want to find all instances of broken links that look like this:

<a href="http://any.url.here/%7BlocalLink:1369%7D%7C%7CThank%20you%20for%20registering"&gt;broken link</a>

and fix it so that it looks like this:

<a href="/{localLink:1369}" title="Thank you for registering">link</a>

There may be a number of these broken links in the text field. My difficulty is working out how to reuse the matched ID (in this case 1369). In the content this ID changes from link to link, as does the url and the link text.

Thanks,

David

EDIT: To clarify, I am writing C# code to run through hundreds of long text fields to fix broken links in them. Each single text field contains html that can have any number of broken links in there - the regex needs to find them all and replace them with the correct version of the link.

+2  A: 

I'm assuming that you already have the element and the attributes parsed. So to process the URL, use something like this:

 string url = "http://any.url.here/%7BlocalLink:1369%7D%7C%7CThank%20you%20for%20registering";
 Match match = Regex.Match(HttpUtility.UrlDecode(url), @"^http://[^/]+/\{(?&lt;local&gt;[^:]+):(?&lt;id&gt;\d+)\}\|\|(?&lt;title&gt;.*)$");
 if (match.Success) {
  Console.WriteLine(match.Groups["local"].Value);
  Console.WriteLine(match.Groups["id"].Value);
  Console.WriteLine(match.Groups["title"].Value);
 } else {
  Console.WriteLine("Not one of those URLs");
 }
Lucero
A: 

To include the match in the replacement string, you use $&.

There are a number of other substitution markers that can be used in the replacement string, see here for the list.

Richard
+2  A: 

Take this with a grain of salt, HTML and Regex don't play well together:

(<a\s+[^>]*href=")[^"%]*%7B(localLink:\d+)%7D%7C%7C([^"]*)("[^>]*>[^<]*</a>)

When applied to your input and replaced with

$1/{$2}" title="$3$4

the following is produced:

<a href="/{localLink:1369}" title="Thank%20you%20for%20registering">broken link</a>

This is as close as it gets with regex alone. You'll need to use a MatchEvaluator delegate to remove the URL encoding from the replacement.

Tomalak
This is very close - thank you for helping. A couple of points:1. The regex also matches correct links, which I don't want2. It replaces the broken links, but not quite right, it gives:<a href="http://url.still.here/%7BlocalLink:1369%7D" title="}||Thank you for registering">link</a> - I need to remove the url.still.here bit, also the }|| in the title attribute.3. The original source is html encoded, but I need the replaced text to use {localLink:1369} instead of %7BlocalLink:1369%7D.Can you help?Thanks,David
David Conlisk
I've made a few changes to my regex, it should do it now.
Tomalak
A: 

Thanks to everyone for their help. Here is what I used in the end:

const string pattern = @"(<a\s+[^>""]*href="")[^""]+(localLink:\d+)(?:%7[DC])*([^""]+)(""[^>]*>[^<]*</a>)";
// Create a match evaluator to replace the matched links with the correct markup
var myEvaluator = new MatchEvaluator(FixLink);

var strNewText = Regex.Replace(strText, pattern, myEvaluator, RegexOptions.IgnoreCase);

internal static string FixLink(Match m)
    {
        var strUrl = m.ToString();
        const string namedPattern = @"(<a\s+[^>""]*href="")[^""]+(localLink:\d+)(?:%7[DC])*([^""]+)(""[^>]*>[^<]*</a>)";
        var regex = new Regex(namedPattern);

        //const string strReplace = @"$1/{$2}"" title=""$4";
        const string strReplace = @"$1/{$2}"" title=""$4";

        HttpContext.Current.Response.Write(String.Format("Replacing '{0}' with '{1}'", strUrl, regex.Replace(strUrl, strReplace)));
        return regex.Replace(strUrl, strReplace);
    }
David Conlisk
I think you did not understand the use of the MatchEvaluator.
Tomalak