views:

85

answers:

2

10 websites need to be cached. When caching: photos, css, js, etc are not displayed properly because the base domain isn't attached to the directory. I need a regex to add the base domain to the directory. examples below

base domain: http://www.example.com

the problem occurs when reading cached pages with img src="thumb/123.jpg" or src="/inc/123.js".

they would display correctly if it was img src="http://www.example.com/thumb/123.jpg" or src="http://www.example.com/inc/123.js".

regex something like: if (src=") isn't followed by the base domain then add the base domain

A: 

Matching regular expression:

(?:src|href)="(http://www\.example\.com/)?.+
Delan Azabani
Can't get this to work. I tried strDomain = http://www.example.com and RegEx.Pattern = "(?:src|href)=chr(34)(strDomain)?.+" and when I tried strHTMLCode = RegEx.Replace(strHTMLCode) I got an error
this one doesn't replace, just matches. If there's a RegEx.Match() method, This should return true for all src or href tags in any xhtml document.
Tim
OK, I solved the problem by using the base reference tag. w3schools.com: http://www.w3schools.com/tags/tag_base.asp . Thanks everyone for their help
well, that's an approach that just went right past me, lol. Sure, take the easy way out ;o)
Tim
+2  A: 

without knowing the language, you can use the (maybe most portable) substitute modifier:

s/^(src=")([^"]+")$/$1www\.example\.com\/$2/

This should do the following: 1. the string 'src="' (and capture it in variable $1) 2. one or more non-double-quote (") character followed by " (and capture it in variable $2) 3. Substitutes 'www.example.com/' in between the two capture groups.

Depending on the language, you can wrap this in a conditional that checks for the existence of the domain and substitutes if it isn't found.

to check for domain: /www\.example\.com/i should do.

EDIT: See comments:

For PHP, I would do this a bit differently. I would probably use simplexml. I don't think that will translate well, though, so here's a regex one...

$html = file_get_contents('/path/to/file.html');
$regex_match = '/(src="|href=")[^(?:www.example.com\/)]([^"]+")/gi';
$regex_substitute = '$1www.example.com/$2';
preg_replace($regex_match, $regex_substitute, $html);

Note: I haven't actually run this to debug it, it's just off the cuff. I would be concerned about 3 things. first, I am unsure how preg_replace will handle the / character. I don't think you're concerned with this, though, unless VB has a similar problem. Second, If there's a chance that line breaks would get in the way, I might change the regex. Third, I added the [^(?:www\.example\.com)] bit. This should change the match to any src or href that doesn't have www.example.com/ there, but this depends on the type of regex being used (POSIX/PCRE).

The rest of the changes should be fine (I added href=" and also made it case-insensitive (\i) and there's a requirement to make it global (\g) otherwise, it will just match once).

I hope that helps.

Tim
How would i set this up so it will alter all the html at once. using vbscript for this (dont ask)strHTML = all the cached HTML codestrDomain = domain name Set RegEx = New RegExp RegEx.Pattern = "s/^(src=")([^"]+")$/$1strDomain\/$2/" RegEx.Multiline = True RegEx.Global = True newstrHTML = RegEx.Replace(strHTML)How do i set up the regex in vbscript to just substitute the domain if its not present in the directory. I'm not very good at regex at all. TIA
I'll be honest, I have never used vb. Also, I'm having trouble "seeing" the code, can you edit your question with a code block to see it better? One more thing, I would add the trailing / to the strDomain variable (if I'm reading that correctly). Then you won't have any weird escaping needs.
Tim
I guess we cant use line breaks in the comment section. I'll throw up a plain text file on my website so you can see what I'm talking about" http://www.genxts.com/regex.txt
I'm afraid that my help stops at the basic regex. However, if vb really is that easy to understand, then this should work. The question in my mind is what RegEx.Replace() actually does. If it simply overwrites the supplied parameter, then I see it working. If it does something else, then I am not sure... I can give you a PHP or Perl version...
Tim
php version would be great. i can convert between the two usually. TIA