tags:

views:

41

answers:

3

Need to replace a domain name on all the links on the page that are not images or pdf files. This would be a full html page received through a proxy service.

Example:
<a href="http://www.test.com/bla/bla"&gt;test&lt;/a&gt;&lt;a href="/bla/bla"><img src="http://www.test.com" /><a href="http://www.test.com/test.pdf"&gt;pdf&lt;/a&gt;
<a href="http://www.test.com/bla/bla/bla"&gt;test1&lt;/a&gt;

Result:
<a href="http://www.newdomain.com/bla/bla"&gt;test&lt;/a&gt;&lt;a href="/bla/bla"><img src="http://www.test.com" /><a href="http://www.test.com/test.pdf"&gt;pdf&lt;/a&gt;
<a href="http://www.newdomain.com/bla/bla/bla"&gt;test1&lt;/a&gt;
+2  A: 

If you are using .NET, I strongly suggest you to use HTML Agility Pack Direct parsing using regex can be very error prone. This questions is also similar to the post below.

http://stackoverflow.com/questions/2438267/what-regex-should-i-use-to-remove-links-from-html-code-in-c/2438292#2438292

Fadrian Sudaman
Not using .NET, js/php
A: 

If the domain is http://www.example.com, the following should do the trick:

/http:\/\/www\.example\.com\S*(?!pdf|jpg|png|gif)\s/

This uses a negative lookahead to ensure that the regex matches a string only if the string does not contain pdf,png,jpg or gif at the specified position.

Crimson
that did not work :(
Remove the trailing \s if your links do not end in a whitespace. Use this: /http:\/\/www\.example\.com\S*(?!pdf|jpg|png|gif)/
Crimson
tried running that against an example above, it replaces all the urls
A: 

If none of your pdf urls have query parameters (like a.pdf?asd=12), the following code will work. It replaces only absolute and root-relative urls.

var links = document.getElementsByTagName("a");
var len = links.length;
var newDomain = "http://mydomain.com";
/**
 * Match absolute urls (starting with http) 
 * and root relative urls (starting with a `/`)
 * Does not match relative urls like "subfolder/anotherpage.html"
 * */
var regex = new RegExp("^(?:https?://[^/]+)?(/.*)$", "i");
//uncomment next line if you want to replace only absolute urls
//regex = new RegExp("^https?://[^/]+(/.*)$", "i");
for(var i = 0; i < len; i++)
{
  var link = links.item(i);
  var href = link.getAttribute("href");
  if(!href) //in case of named anchors
    continue;
  if(href.match(/\.pdf$/i)) //if pdf
    continue;
  href = href.replace(regex, newDomain + "$1");
  link.setAttribute("href", href);
}
Amarghosh