I'm not good with regexps, but if what you're doing is just replacing external URLs (i.e. not part of your site/application) with an internal URL that will track click-thrus and redirect the user, then it should be easy to construct a regexp that will match only external URLs.
So let's say your domain is foo.com
, then you just need to create a regexp that will only match a hyperlink that doesn't contain a URL starting with http://foo.com
. Now, as I said, I'm pretty bad with regexps, but here's my best stab at it:
$reg[0] = '`<a(\s[^>]*)href="(?!http://foo.com)([^"]*)"([^>]*)>`si';
Edit:
If you want to track click-thrus to internal URLs as well, then just replace http://foo.com
with the URL of your redirect/tracking page, e.g. http://foo.com/out.php
.
I'll walk through an example scenario just to show what I'm talking about. Let's say you have the below newsletter:
<h1>Newsletter Name</h1>
<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Donec lobortis,
ligula <a href="http://bar.com">sed sollicitudin</a> dignissim, lacus dolor
suscipit sapien, <a href="http://foo.com">eget auctor</a> ipsum ligula
non tortor. Quisque sagittis sodales elit. Mauris dictum blandit lacus.
Mauris consequat <a href="http://last.fm">laoreet lacus</a>.</p>
For the purpose of this exercise, the search pattern will be:
// Only match links that don't begin with: http://foo.com/out.php
`<a(\s[^>]*)href="(?!http://foo.com/out\.php)([^"]*)"([^>]*)>`si
This regexp can be broken down into 3 parts:
<a(\s[^>]*)href="
(?!http://foo.com/out\.php)([^"]*)
"([^>]*)>
On the first pass of the search, the script will examine:
<a href="http://bar.com">
This link satisfies all 3 components of the regexp, so the URL is stored in the database and is replaced with http://foo.com/out.php?id=1
.
On the second pass of the search, the script will examine:
<a href="http://foo.com/out.php?id=1">
This link matches 1 and 3, but not 2. So the search will move on to the next link:
<a href="http://foo.com">
This link satisfies all 3 components of the regexp, so it the URL is stored in the database and is replaced with http://foo.com/out.php?id=2
.
On the 3rd pass of the search, the script will examine the first 2 (already replaced) links, skip them, and then find a match with the last link in the newsletter.