tags:

views:

221

answers:

5

I need to find the best way (in terms of performance) to find if a given string is a URL.
REGEXP won't help, as www.eeeeeeeeeeeeeee.bbbbbbbbbbbbbbbb.com is a valid url name but not in any network known to man.
I am thinking using CURL and see if I get status 200 back or just file_get_contents and analyze the result.
Is there a better way?

+6  A: 

Don't fetch the whole contents - that could be enormous. Issue a HEAD request instead.

You could do some validation first, of course - remove things which are invalid as URLs, rather than just URLs which aren't currently served by anything. After that, issuing a HEAD request is about as good as it gets. Having said that, it becomes a grey area... what about a URL which returns "authorization required"? It could be a password protected directory, but if you knew the password you'd then get back a 404 because the file itself doesn't exist...

Jon Skeet
A: 
$host != gethostbyname($host)

for checking the host.

Zed
+4  A: 

This article outlines how to perform a DNS request from php. That might be the fastest option, although it would not tell you anything like if the server is online, file is found, etc. But it would tell you that the url is registered to an IP. It's up to you whether that would fit your definition of "valid"

Chris Thompson
+1  A: 

You don't mean a URL, you mean a Domain Name

ראובן
A: 

I would strongly suggest using CURL but just the headers without fetching any contents.

Here is the function, I use to verify if the given URL is valid and found.

function __checkUrl($url)
{
    //First checking with pattern whether it is proper or not
    $pattern = '/^(([\w]+:)?\/\/)?(([\d\w]|%[a-fA-f\d]{2,2})+(:([\d\w]|%[a-fA-f\d]{2,2})+)?@)?([\d\w][-\d\w]{0,253}[\d\w]\.)+[\w]{2,4}(:[\d]+)?(\/([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)*(\?(&?([-+_~.\d\w]|%[a-fA-f\d]{2,2})=?)*)?(#([-+_~.\d\w]|%[a-fA-f\d]{2,2})*)?$/';
    if (preg_match($pattern, $url))
    {
        $ch = curl_init();

        // set URL and other appropriate options
        curl_setopt($ch, CURLOPT_URL, $url);
        curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
        curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
        curl_setopt($ch, CURLOPT_MAXREDIRS, 3);
        curl_setopt($ch, CURLOPT_NOBODY, true);
        curl_setopt($ch, CURLOPT_AUTOREFERER, true);
        curl_setopt($ch, CURLOPT_FORBID_REUSE, true);
        curl_setopt($ch, CURLOPT_CONNECTTIMEOUT, 4);
        curl_setopt($ch, CURLOPT_TIMEOUT, 4);

        // grab URL
        $output = curl_exec($ch);
        // Get response code
        $response_code = curl_getinfo($ch, CURLINFO_HTTP_CODE);
        $newurl = curl_getinfo($ch, CURLINFO_EFFECTIVE_URL);

        // Not found?
        if ($response_code == '404') {
            return false;
        } else {
            return $newurl;
        }
    }
    else
    {
        return false;
    }
}

With this function, I first check the URL is actually valid with Regex. After that curl that. By setting CURLOPT_FOLLOWLOCATION to true, we are taking care of 301 and similar redirects, but limit the no. of redirections to 3. And we finally we return Effective URL after all the redirections.

Hope this helps.

Thanashyam
You know that URL validation regex is quite bogus, right? (As hinted in the OP's question.)
bobince
Is this the "head" request described in the first answer, or you fetch here the entire page?
Itay Moav
@Itay Moav: curl_setopt($ch, CURLOPT_NOBODY, true); -- causes curl to send a HEAD request.
GZipp