tags:

views:

38

answers:

1

Hi, I have the HTML in the form of a string and before I display it in the browser, I want to change all the relative urls on the page to absolute urls. How can I do it the best way? I was thinking of Regex as an option to get the href attributes of anchor tags and append the base url to it, but not sure how to do it? Can someone help or suggest a better solution?

PS: I want to exclude all the links that have only "#" symbol in the link. For example: I want to replace <a href="/dir/file1.htm" /> with <a href="http://mysite/dir/file1.htm" /> but I want to exclude <a href="#A1" />

I would appreciate any help on this.

+3  A: 

In general, using RegEx to parse HTML is a bad idea - see here for why.

You can use an HTML parser like the HTML Agility Pack in order to extract URLs from HTML:

HtmlDocument doc = new HtmlDocument();
doc.Load("file.htm");
foreach(HtmlNode link in doc.DocumentElement.SelectNodes("//a[@href"])
{
   HtmlAttribute att = link["href"];
   att.Value = FixLink(att);
}

You can then exclude any URLs that start with #.

Oded
Thanks for the answer. I heard about this but didn't know it would load streams as well until I downloaded it. I thought I would give it a try using Regex but now dropped out of the idea since this is so easy to implement.
Sridhar