views:

95

answers:

5

I have a string like
Pakistan, officially the <a href="Page.aspx?Link=Islamic Republic of Pakistan">Islamic Republic of Pakistan</a>

Now I am using
System.Text.RegularExpressions.Regex.Replace(inputText, "(\\bPakistan\\b)", "something"); to replace Pakistan outside the tags. But I don't want to replace Pakistan occurring within the <a></a> tags.
Edit: an actual string

Pakistan (Urdu: پاکِستان), officially the Islamic Republic of Pakistan, is a country in South Asia. It has a 1,046-kilometre (650 mi) coastline along the Arabian Sea and Gulf of Oman in the south and is bordered by Afghanistan and Iran in the west, India in the east and China in the far northeast.[6] Tajikistan also lies very close to Pakistan but is separated by the narrow Wakhan Corridor.


And An array of strings

string[] links={"Pakistan","Islamic Republic","Republic of Pakistan","South Asia","Arabian Sea","Gulf","Oman","Gulf of Oman","the south","in the south","Afghanistan","Iran","the west","in the west","west India","the east","China","Tajikistan","the narrow","Wakhan Corridor","Central Asia","the Middle","Middle East","the Middle East"}

I want to replace every occurrence of every string in this array with <a href="page.aspx?link=thisString">thisString</a>. and I could not correctly add links to strings like "Republic of Pakistan" where Pakistan is also another string in this array.

+2  A: 

If you're trying to do something in the context of HTML syntax, use an HTML parser.

Amber
A: 

Get each line of text into a string A

Remove the bit between <a></a> and store it in string B

Run your Regex on the remaining text in string A

return A + B

Peter McGrattan
Location of <a></a> tags will be lost.
Taz
No it won't, you need to show a simple code sample with some clear sample data.
Peter McGrattan
You are right in this string it wont. But `<a></a>` does not necessarily appear at the end and there are more than one `<a></a>` blocks.
Taz
Yes: so show us some code and some useful test data so we can have a chance of helping you better in all scenarios!
Peter McGrattan
I have edited the question. I think it is more clear now.
Taz
+1  A: 

Here's how you can do the opposite of what you're asking (replace only the instances inside the tags):

content = Regex.Replace(content, @"(?<=\<\s*a[^>]+)\bPakistan\b(?=.*?\>)", "India");

This is very untested and not what you want, but it could give you some hints. This uses zero-width lookaround assertions. I'm sure there are many other ways to do it.

This is really pushing the limits of regex. You should probably use an HTML parser.

Edit: using negative lookbehind, this appears to work (please test it!):

content = Regex.Replace(content, @"(?<!\<\s*a[^>]+)\bPakistan\b", "India");
Chris Schmich
Does the C# regex allow variable-width expressions in negative lookbehinds? Most regex engines that support lookbehinds don't allow variable-width expressions (due to not knowing how far back to step to attempt to match them).
Amber
My potentially flawed understanding of "zero-width" was that it meant the assertion captured nothing. The .NET regex example at http://msdn.microsoft.com/en-us/library/bs2twtah.aspx#sectionToggle8 appears to use variable-width expressions: "(?<!(Saturday|Sunday) )\b\w+ \d{1,2}, \d{4}\b" (the Saturday/Sunday alternation).
Chris Schmich
@Dav: .NET is nearly unique among regex flavors in that you can use any expression you like inside a lookbehind. @Chris: it's more correct to say that a zero-width assertion (like a lookbehind) *consumes* nothing. Capturing is something else.
Alan Moore
I used it like thisinputText = Regex.Replace(inputText, @"(?<=\<\s*a[^<]+)\bStringToReplace\b(?=.*?\>)", "DBPT"); inputText = System.Text.RegularExpressions.Regex.Replace(inputText, "(\\bStringtoReplace\\b)", Replacement); inputText = Regex.Replace(inputText, @"(?<=\<\s*a[^<]+)\bDBPT\b(?=.*?\>)", StringtoReplace);
Taz
+1  A: 

Although @Chris solution does not works exactly here, but you can use in this way.

string content = "Pakistan is <a href=\" Pakistan is\">Pakistan an islamic country</a>";
string content2= Regex.Replace(content,@"\bPakistan\b", "India");
string content3 = Regex.Replace(content2, @"(?<=\<\s*a[^<]+)\bIndia\b(?=.*?\>)", "pakistan");        
Console.WriteLine(content3);    

but this is not a very efficient solution.

Adeel
May be not very efficient but easy to understand and implement. Thanks
Taz
I used it like this inputText = Regex.Replace(inputText, @"(?<=\<\s*a[^<]+)\bStringToReplace\b(?=.*?\>)", "DBPT"); inputText = System.Text.RegularExpressions.Regex.Replace(inputText, "(\\bStringtoReplace\\b)", Replacement); inputText = Regex.Replace(inputText, @"(?<=\<\s*a[^<]+)\bDBPT\b(?=.*?\>)", StringtoReplace)
Taz
+1  A: 

For the first part of your question, I would match either a link or the target word:

Regex r = new Regex(@"<a\s+.*?</a>|\bPakistan\b");

Then I would use a MatchEvaluator to check which one I matched and replace accordingly: if it's a link, plug it back in; if it's the target word, linkify it.

For the second part, you can Join the strings in the array into a regex alternation, like this:

string regex = String.Format(@"\b({0})\b", String.Join("|", links));

Just remember that an alternation returns the first matching alternative, not the longest. If any alternative A is a prefix of alternative B, B should be listed before A. For example, the Middle East should come before the Middle in your list.

Alan Moore