An easy way to (try to!) get links from a string is something like this:
$text = 'I am looking at some sort of existing filter which can sanitize
the user input to avoid XSS. Probably I can use htmlspecialchars for that.
But at the same time I want to be able to parse all links (should match
a.com, www.a.com and http://www.a.com and if it is
http://www.aaaaaaaaaaaaaaaaaaaaaaaaaa.com then it should display it
as aaa..a.com), e-mails and smileys.
I am wondering what is the best way to go about it. I am currently using
a php function with some regex, but many times the regex simply fails
(because of link recognition is incorrect etc.). I want something very
similar to the parser used during Google Chat (even a.com works).';
preg_match_all('/\S+\.(?:com|org|net)/i', $text, $urls);
print_r($urls);
Which produces:
Array
(
[0] => Array
(
[0] => a.com
[1] => www.a.com
[2] => http://www.a.com
[3] => http://www.aaaaaaaaaaaaaaaaaaaaaaaaaa.com
[4] => aaa..a.com
[5] => a.com
)
)
And after matching the (possible!) urls, you could sanitize the list: ie. remove invalid ones like 'aaa..a.com' and shorten very long urls like 'http://www.aaaaaaaaaaaaaaaaaaaaaaaaaa.com'.
I don't recommend cramming everything in one large, unmaintainable regex. Do it in steps.
Good luck!
PS. Needless to say, you can/should expand the list of tld's yourself. (?:com|org|net) was just an example.