views:

3009

answers:

6

Hello

I have this text

$string = "this is my friend's website http://example.com I think it is coll";

how can I extract the link into another variable

I know it should be by using regular expression especially "preg_match" but I don't know how?

Thanks

+1  A: 

I don't know PHP so I can't give you exact syntax off the type of my head, but I would suggest using regular expressions. Here is a link on using regular expressions in PHP: http://www.regular-expressions.info/php.html. Also, here is a link for email regular expressions: http://www.regular-expressions.info/email.html

Good luck.

Max Schmeling
+2  A: 

This has already been covered here.

n3rd
A: 
preg_match_all('/[a-z]+:\/\/\S+/', $string, $matches);

This is an easy way that'd work for a lot of cases, not all. All the matches are put in $matches. Note that this do not cover links in anchor elements (<a href=""...), but that wasn't in your example either.

antennen
-1: you've just created an XSS vulnerability, since it would also extract javascript: URLs.
Michael Borgwardt
It's not stated what he'd use it for, hence I don't account for that. He just wanted to get URLs into variables.
antennen
@Michael: Finding javascript URLs is not yet a vulnerability; using them without any check is. Sometimes the presence and number of such URLs is useful information. I'd have chosen a different delimiter. :)
toscho
A: 

Urls have a quite complex definition - you must decide what you want to capture first. A simple example capturing anything starting with http:// and https:// could be:

preg_match_all('!https?://[\S]+!', $string, $matches);
$all_urls = $matches[0];

Note that this is very basic and could capture invalid urls. I would recommend catching up on posix- and php regular expressions for more complex things.

soulmerge
+2  A: 

If the text you extract the URLs from is user-submitted and you're going to display the result as links anywhere, you have to be very, VERY careful to avoid XSS vulnerabilities, most prominently "javascript:" protocol URLs, but also malformed URLs that might trick your regexp and/or the displaying browser into executing them as Javascript URLs. At the very least, you should accept only URLs that start with "http", "https" or "ftp".

There's also a blog entry by Jeff where he describes some other problems with extracting URLs.

Michael Borgwardt
A: 

I'm pretty sure there'd be some significant differences between extraction and validation. Let's not link to the validation discussion.

Waldron