tags:

views:

78

answers:

5

I have blog data like:

This is foreign <a href="xyz.com">link</a>, this is my site's <a href="mysite.com">link</a> and so on.

What I want is to do is filter the links of foreign sites, i.e "<a href="xyz.com">link</a>". So that my final output is:

This is foreign link, this is my site's <a href="mysite.com">link</a> and so on.

I tried "preg_replace" but no pattern helped.

+1  A: 

This shouldn't be done with regular expressions.

Try something like a DOM parser.

I don't know if you're using PHP, but this one very easy to use:
http://simplehtmldom.sourceforge.net/

Hope this helps.

macek
Thanks for sharing. This solved my problem
Chetan sharma
@Chetan sharma, then you should mark this as the "accepted" answer :)
macek
@smotchkkiss, Oh I'm sorry, Thanks for telling.
Chetan sharma
+1  A: 

You can use DOMDocument to find all link elements and just update the source that way. I wrote a little example of how to use DOMDocument to find all links. I use this method to rewrite links in some projects I've worked on. I'm sure it wouldn't take much effort to go further and delete the a tag and replace it with text if the url does not match your host.

Eric Butera
+1  A: 

First of all, I have to agree with people who've already said that regex were not the right tool for HTML.

That said, if what you want to do is no more complex than replacing any and all occurences of

<a href="something.tld">foo</a>

with

foo

if something.tld is not your domain, then this should do the trick

preg_replace( '/<a href="http:\/\/(?!mysite.com)(.*?)>(.*?)<\/a>/',
              '$2',
              $mystring );

where $mystring is obviously the string you'd like to modify. However, this uses regex lookarounds, a pretty good giveaway that this was not meant to be done with regexes.

HTH

Thomas PARIS
A: 

Thanks all, I really appreciate your help, I helped me lot to increase my knowledge.

Chetan sharma
A: 

I would strongly encourage you to use http://htmlpurifier.org/ , which will not only make it easy to write a link filter ( http://htmlpurifier.org/docs/enduser-uri-filter.html ) but also protect you from XSS attacks. If you aren't using a whitelisted HTML parser, you need to be treating user-supplied data as literal and escaping html special characters.

bluej100
Thanks for sharing.
Chetan sharma