tags:

views:

45

answers:

4
$bits = preg_split('#((?:https?|ftp)://[^\s\'"<>()]+)#S', $token->data, -1, PREG_SPLIT_DELIM_CAPTURE);

Say,I'm trying to match urls that need to be linkified.The above is too permissive.

I want to only match simple urls like http://google.com, but not <a href="http://google.com"&gt;http://google.com&lt;/a&gt;, or <iframe src="http://google.com"&gt;&lt;/iframe&gt;

+2  A: 

It appears that you're trying to parse HTML using regular expressions. You might want to rethink that.

Nick Bastin
how is matching a url in a string parsing html?
grapefrukt
You're matching the URL within an HTML context. Load the HTML into a DOMDocument and then test each text node against your pattern.
Justin Johnson
I don't see how that linked answer can solve my question,though..
wamp
@wamp: If you're specifically trying to avoid a greedy algorithm that eats HTML tags, that must mean you're in a position (at least sometimes) where your link will be embedded in HTML. And that way lies madness.
Nick Bastin
A: 

try this...

function validUrl($url){
        $return=FALSE;
        $matches=FALSE;
        $regex='#(^';                  #match[1]
        $regex.='((https?|ftps?)+://)?'; #Scheme match[2]
        $regex.='(([0-9a-z-]+\.)+'; #Domain match[5] complete match[4]
        $regex.='([a-z]{2,3}|aero|coop|jobs|mobi|museum|name|travel))'; #TLD match[6]
        $regex.='(:[0-9]{1,5})?'; #Port match[7]
        $regex.='(\/[^ ]*)?'; #Query match[8]
        $regex.='$)#i';
        if( preg_match($regex,$url,$matches) ){
            $return=$matches[0]; $domain=$matches[4];
            if(!gethostbyname($domain)){ 
                $return = FALSE;
            }
        }
        if($return==FALSE){
            return FALSE;
        }
        else{
            return $matches;
        }
    }
jatt
I've updated the question to make it clear.
wamp
@jatt: And how does a more complex regex help in this case? Read the question again.
Tomalak
And in any case, trying to enumerate “valid” TLDs is an exercise in futility.
bobince
A: 

RE

http:\/\/[a-zA-Z0-9\.\-]*

Result

Array
(
    [0] => http://google.com
)
articlestack
A: 

More effective RE

[hf]t{1,2}p:\/\/[a-zA-Z0-9\.\-]*

Result

Array
(
    [0] => Array
        (
            [0] => ftp://article-stack.com
            [1] => http://google.com
        )
)
articlestack