views:

131

answers:

2

Hello,

I am looking at some sort of existing filter which can sanitize the user input to avoid XSS. Probably I can use htmlspecialchars for that. But at the same time I want to be able to parse all links (should match a.com, www.a.com and http://www.a.com and if it is http://www.aaaaaaaaaaaaaaaaaaaaaaaaaa.com then it should display it as aaa..a.com), e-mails and smileys.

I am wondering what is the best way to go about it. I am currently using a php function with some regex, but many times the regex simply fails (because of link recognition is incorrect etc.). I want something very similar to the parser used during Google Chat (even a.com works).

Thank you for your time.

A: 

An easy way to (try to!) get links from a string is something like this:

$text = 'I am looking at some sort of existing filter which can sanitize 
the user input to avoid XSS. Probably I can use htmlspecialchars for that. 
But at the same time I want to be able to parse all links (should match 
a.com, www.a.com and http://www.a.com and if it is 
http://www.aaaaaaaaaaaaaaaaaaaaaaaaaa.com then it should display it 
as aaa..a.com), e-mails and smileys.

I am wondering what is the best way to go about it. I am currently using 
a php function with some regex, but many times the regex simply fails 
(because of link recognition is incorrect etc.). I want something very 
similar to the parser used during Google Chat (even a.com works).';

preg_match_all('/\S+\.(?:com|org|net)/i', $text, $urls);

print_r($urls);

Which produces:

Array
(
    [0] => Array
        (
            [0] => a.com
            [1] => www.a.com
            [2] => http://www.a.com
            [3] => http://www.aaaaaaaaaaaaaaaaaaaaaaaaaa.com
            [4] => aaa..a.com
            [5] => a.com
        )

)

And after matching the (possible!) urls, you could sanitize the list: ie. remove invalid ones like 'aaa..a.com' and shorten very long urls like 'http://www.aaaaaaaaaaaaaaaaaaaaaaaaaa.com'.

I don't recommend cramming everything in one large, unmaintainable regex. Do it in steps.

Good luck!

PS. Needless to say, you can/should expand the list of tld's yourself. (?:com|org|net) was just an example.

Bart Kiers
A: 

For smileys you might want to look at http://www.php.net/manual/en/book.bbcode.php (requires php 5.2.0 or better unless you can install it from PECL)

eyescream