views:

213

answers:

4

I'm trying to put together a plug-in for vBulletin to filter out links to filesharing sites. But, as I'm sure you often hear, I'm a newb to php let alone regexes.

Basically, I'm trying to put together a regex and use a preg_replace to find any urls that are from these domains and replace the entire link with a message that they aren't allowed. I'd want it to find the link whether it's hyperlinked, posted as plain text, or enclosed in [CODE] bb tags.

As for regex, I would need it to find URLS with the following, I think:

  1. Starts with http or an anchor tag. I believe that the URLS in [CODE] tags could be processed the same as the plain text URLS and it's fine if the replacement ends up inside the [CODE] tag afterward.
  2. Could contain any number of any characters before the domain/word
  3. Has the domain somewhere in the middle
  4. Could contain any number of any characters after the domain
  5. Ends with a number of extentions such as (html|htm|rar|zip|001) or in a closing anchor tag.

I have a feeling that it's numbers 2 and 4 that are tripping me up (if not much more). I found a similar question on here and tried to pick apart the code a bit (even though I didn't really understand it). I now have this which I thought might work, but it doesn't:

<?php  
$filterthese = array('domain1', 'domain2', 'domain3');  
$replacement = 'LINKS HAVE BEEN FILTERED MESSAGE';  
$regex = array('!^http+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*(html|htm|rar|zip|001)$!',  
'!^<a+([a-z0-9-]+\.)*$filterthese+([a-z0-9-]+\.)*</a>$!');
$this->post['message'] = preg_replace($regex, $replacement, $this->post['message']);
?>

I have a feeling that I'm way off base here, and I admit that I don't fully understand php let alone regexes. I'm open to any suggestions on how to do this better, how to just make it work, or links to RTFM (though I've read up a bit and I'm going to continue).

Thanks.

A: 

Hmm, my first guess: You put $filterthese directly inside a single-quoted string. That single quotes don't allow for variable substitution. Also, the $filterthese is an array, that should first be joined:

var $filterthese = implode("|", $filterthese);

Maybe I'm way off, because I don't know anything about vBulletin plugins and their embedded magic, but that points seem worth a check to me.

Edit: OK, on re-checking your provided source, I think the regexp line should read like this:

$regex = '!(?#
  possible "a" tag [start]: )(<a[^>]+href=["\']?)?(?#
  offending link:           )https?://(?#
    possible subdomains:    )(([a-z0-9-]+\.)*\.)?(?#
    domains to block:       )('.implode("|", $filterthese).')(?#
    possible path:          )(/[^ "\'>]*)?(?#
  possible "a" tag [end]:   )(["\']?[^>]*>)?!';
Boldewyn
A: 

You can use parse_url on the URLs and look into the hashmap it returns. That allows you to filter for domains or even finer-grained control.

Joey
+1  A: 

I think you can avoid the overhead of this in using the filter_var built_in function.

You may use this feature since PHP 5.2.0.

$good_url = filter_var( filter_var( $raw_url, FILTER_SANITIZE_URL), FILTER_VALIDATE_URL);
Arno
A: 

Break down your problem into two steps. First find out (somehow) all the URLs present in the post. Secondly, match each URL against blacklisted URLs. Assuming that you have:

$blacklist = array(
    'domain1.com',
    'domain2.net',
    'domain3.co.uk'
);

the code might go something along these lines:

foreach ( $urls as $url )
{
  foreach ( $blacklist as $ban )
  {
    $ban = preg_quote( $ban ); // domain1.com becomes domain1\.com
    if ( preg_match( '@^https?://([^\.]+\.)*' . $ban . '(/|$)@', $url ) )
    // reg-exp becomes ^https?://([^\.]+\.)*domain1\.com(/|$)
    // should match http://domain1.com
    // should match http://domain1.com/
    // should match https://domain1.com
    // should match http://www.domain1.com
    // should match http://www.download.domain1.com
    // should match http://www.download.server1.domain1.com
    // should not match match http://another-domain1.com
    // should not match match http://domain1.company.com
    // should not match match http://domain1.com.company.com
    {
      // banned url
    }
  }
}
Salman A