tags:

views:

691

answers:

5

I've been trying myself, and searching online, to write this regular expression but without success.

I need to validate that a given URL is from a specific domain and a well-formed link (in PHP). For example:

Good Domain: example.com

So good URLs from example.com:

So bad URLs not from example.com:

Some notes: I don't care about "http" verus "https" but if it matters to you assume "http" always The code that will use this regex is PHP so extra points for that.

UPDATE 2010:

Gruber adds a great URL regex:

?i)\b((?:[a-z][\w-]+:(?:/{1,3}|[a-z0-9%])|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:'".,<>?«»“”‘’]))

See his post: An Improved Liberal, Accurate Regex Pattern for Matching URLs

A: 
\b(https?)://([-A-Z0-9]+\.)*blah.com(/[-A-Z0-9+&@#/%=~_|!:,.;]*)?(\?[A-Z0-9+&@#/%=~_|!:,.;]*)?
Jeremy Stein
I think that would allow http://blah.com.evil.domain (assuming the A-Z is A-Za-z)
Douglas Leeder
Comment system stripped of the h-t-t-p-: from my previous example...
Douglas Leeder
A: 
!^https?://(?:[a-zA-Z0-9-]+\.)*blah\.com(?:/[^#]*(?:#[^#]+)?)?$!
chaos
+2  A: 

My stab at it

<?php

$pattern = "#^https?://([a-z0-9-]+\.)*blah\.com(/.*)?$#";

$tests = array(
    'http://blah.com/so/this/is/good'
  , 'http://blah.com/so/this/is/good/index.html'
  , 'http://www.blah.com/so/this/is/good/mice.html#anchortag'
  , 'http://anysubdomain.blah.com/so/this/is/good/wow.php'
  , 'http://anysubdomain.blah.com/so/this/is/good/wow.php?search=doozy'
  , 'http://any.sub-domain.blah.com/so/this/is/good/wow.php?search=doozy' // I added this case
  , 'http://999.sub-domain.blah.com/so/this/is/good/wow.php?search=doozy' // I added this case
  , 'http://obviousexample.com'
  , 'http://bbc.co.uk/blah.com/whatever/you/get/the/idea'
  , 'http://blah.com.example'
  , 'not/even/a/blah.com/url'
);

foreach ( $tests as $test )
{
  if ( preg_match( $pattern, $test ) )
  {
    echo $test, " <strong>matched!</strong><br>";
  } else {
    echo $test, " <strong>did not match.</strong><br>";
  }
}

//  Here's another way
echo '<hr>';
foreach ( $tests as $test )
{
  if ( $filtered = filter_var( $test, FILTER_VALIDATE_URL ) )
  {
    $host = parse_url( $filtered, PHP_URL_HOST );
    if ( $host && preg_match( "/blah\.com$/", $host ) )
    {
      echo $filtered, " <strong>matched!</strong><br>";
    } else {
      echo $filtered, " <strong>did not match.</strong><br>";
    }
  } else {
    echo $test, " <strong>did not match.</strong><br>";
  }
}
Peter Bailey
The docs for the `parse_url` function state that it isn't meant to validate URLs: invalid URLs may still get parsed. So you need some additional checks.
D. Evans
Oh, I agree - it probably needs more rigorous testing. Still, my regex solution works just as well.
Peter Bailey
I adopted the logic of your post into my 2nd algo. Seems to work well!
Peter Bailey
What about `http://blah.com.example`?
Gumbo
@Gumbo - thanks! Updated regex
Peter Bailey
Brilliant Peter :) - exactly what I was looking for.
foxed
+1  A: 

Perhaps:

^https?://[^/]*blah\.com(|/.*)$

Edit:

Protect against http://editblah.com

^https?://(([^/]*\.)|)blah\.com(|/.*)$
Douglas Leeder
Close! But this would false positive a domain like fooblah.com
Peter Bailey
+6  A: 

Do you have to use a regex? PHP has a lot of built in functions for doing this kind of thing.

filter_var($url, FILTER_VALIDATE_URL)

will tell you if a URL is valid, and

    $domain = parse_url($url, PHP_URL_HOST);

will tell you the domain it refers to.

It might be clearer and more maintainable than some mad regex.

D. Evans