views:

118

answers:

3

Hi -i'm trying to find all the anchor tags and appending the href value with a variable. for example

<a href="/page.aspx">link</a> will become <a href="/page.aspx?id=2">
<A hRef='http://www.google.com'&gt;&lt;img src='pic.jpg'></a> will become <A hRef='http://www.google.com?id=2'&gt;&lt;img src='pic.jpg'></a>

I'm able to match all the anchor tags and href values using regex, then i manually replace the values using string.replace, however i dont think its the efficient way to do this. Is there a solution where i can use something like regex.replace(html,newurlvalue)

+2  A: 

Yes you can. The standard warning applies -- regular expressions are not sufficiently powerful to reliably parse html. In other words, it may actually work for you in the most straightforward & controlled examples, but there are many cases where this will fail.

However, if you already have the regular expression written then paste it into Regex Hero along with your HTML, click the "Replace" tab and type in your replacement string.

Once you've verified that it's working click Tools > Generate .NET Code and you'll have your answer.

UPDATE: So here's an imperfect example of this in action using groups:

string strRegex = @"(?<=href="")(?<url>[^""]+)(?="")";
RegexOptions myRegexOptions = RegexOptions.IgnoreCase;
Regex myRegex = new Regex(strRegex, myRegexOptions);
string strTargetString = @"<a href=""/page.aspx"">link</a> will become <a href=""/page.aspx?id=2"">" + (char)10 + "<A hRef='http://www.google.com'&gt;&lt;img src='pic.jpg'></a> will become <A hRef='http://www.google.com?id=2'&gt;&lt;img src='pic.jpg'></a>";
string strReplace = "http://mysite.com${url}";

return myRegex.Replace(strTargetString, strReplace);

http://regexhero.net/tester/?id=e993fbf1-acb7-4f59-af87-94253a6e8221

The (?<url>[^"]+) part is a named group that can be referenced in the replacement string as ${url}.

UPDATE #2:

So to only match the URL's without a question mark you'd do this:

(?<=href=")(?![^"]*\?)(?<url>[^"]+)(?=")

The (?![^"]*\?) part is a negative lookahead that does the trick.

Steve Wortham
ace
Steve Wortham
sorry this is embarassing. my regex skills are really bad. the pattern is not matching urls enclosed in single quotes, so this won't match <A hRef='http://www.google.com'> i want it to match both single quote, double quote and even one without quotes...so <a href=http://www.google.com> should be good too.
ace
@ace - Well, this is doable when you have well-formed XHTML, but to match the href without any quotes at all is when the regex approach really starts to break down. I would highly recommend the HTML Agility Pack in this scenario.
Steve Wortham
By the way, regular expressions were designed to parse regular languages which is a couple levels below HTML in terms of complexity. I think HTML would be considered a context-sensitive language as listed in Chomsky's hierarchy, which is why it's best to use a specialized HTML parser: http://en.wikipedia.org/wiki/Chomsky_hierarchy
Steve Wortham
Steve i understand, however currently using html parser like agility pack is not an option on our deployment server, that's why i'm looking at regex. can you please help me fix the regex pattern you gave so that all three matches are possible.
ace
+2  A: 

If you're parsing HTML with Regex, the standard advice is to use the HMTL Agility Pack instead.

Jay Riggs
A: 

Have you looked into using jquery for this?

senloe
i'm looking for manipulating at server side
ace