tags:

views:

108

answers:

3

Hey

I am currently making a homepage where logged in users can write comments. The comment string is first run through a function that str_replaces emoticons. After that I want it to exchange

[url=www.whatever.com]linktext[/url]

with:

<a href='www.whatever.com'>linktext</a>

The reason for this is that I want to strip the text for all the html code that isn't controlled by my comment code, in case some users decide to get creative-

and thought it would be best to use preg replace but the code I ended up with (Partially from reading about reg exp from my trusty "O reilly Sql and Php"-book and partially from the web) Is pretty bonkers, and most importantly, doesn't work.

Any help would be appreciated, thanks.

It's probably possible to exchange the entire code, not in 2 segments like I have done. Just decided on that getting 2 smaller parts to work first would be easier, and then merge them afterwards.

code:

function text_format($string)
{
    $pattern="/([url=)+[a-zA-Z0-9]+(])+/";
    $string=preg_replace($pattern, "/(<a href=\')+[a-zA-Z0-9]+(\'>)+/", $string);
    $pattern="/([\/url])+/";
    $string=preg_replace($pattern, "/(<\/a>)+/", $string);    
    return $string;
}
+2  A: 

I experimented a bit with the following:

function text_format($string)
{
    return preg_replace('#\[url=([^\]]+)\]([^\[]*)\[/url\]#', '<a href="$1">$2</a>', $string);
}

However, one immediate fault with this is that if linktext is empty, there will be nothing between <a> and </a>. One way around it would be to do another pass with something like this:

preg_replace('#<a href="([^"]+)"></a>#', '<a href="$1">$1</a>', $string);

Another option would be to use preg_replace_callback and put this logic inside your callback function.

Finally, this is obviously a common "problem" and has been solved many times by others, and if using a more mature open sourced solution is an option, I'd recommend looking for one.

Lauri Lehtinen
Thank you very much for the quick answer. Reg exp is really confusing for a noob, and probably still confusing regardless after years of programming experience. My brain melts just by looking at the mass of brackets :)You are god, the code works like a charm and thank you very much for the help.
Rakoon
@Rakoon - No need to get religious on us. Experience plays a small role here I think. :)
ChaosPandion
True. But after having wrung my brains out for quite a while, an immediate, and intelligent answer to a tricky problem (for a noob) made me just a bit religiously inclined :)
Rakoon
+4  A: 

It looks like you're using something similar to BBCode. Why not use a BBCode parser, such as this one?

http://nbbc.sourceforge.net/

It also handles smilies, replacing them with images. If you use their test page, you will still see the text though, because they don't host the images and they set the alt-text to the smily.

jdmichal
Hmm. A good idea, but i got the parsing to work, and don't need more than a few simple formatting options. The smileys are custom, and built into a themes system I made, so that one can have theme independent smileys.Anyways, thanks for the answer.
Rakoon
+2  A: 

@Lauri Lehtinen's answer is good for learning the idea behind the technique, but you shouldn't use it in practice because it would make your site extremely vulnerable to XSS attacks. Also, link spammers would appreciate the lack of rel="nofollow" on the generated links.

Instead, use something like:

<?php
// \author Daniel Trebbien
// \date 2010-06-22
// \par License
//  Public Domain

$allowed_uri_schemes = array('http', 'https', 'ftp', 'ftps', 'irc', 'mailto');

/**
 * Encodes a string in RFC 3986
 *
 * \see http://tools.ietf.org/html/rfc3986
 */
function encode_uri($str)
{
    $str = urlencode('' . $str);
    $search = array('%3A', '%2F', '%3F', '%23', '%5B', '%5D', '%40', '%21', '%24', '%26', '%27', '%28', '%29', '%2A', '%2B', '%2C', '%3B', '%3D', '%2E', '%7E');
    $replace = array(':', '/', '?', '#', '[', ']', '@', '!', '$', '&', '\'', '(', ')', '*', '+', ',', ';', '=', '.', '~'); // gen-delims / sub-delims / unreserved
    return str_ireplace($search, $replace, $str);
}

function url_preg_replace_callback($matches)
{
    global $allowed_uri_schemes;

    if (empty($matches[1]))
        return $matches[0];
    $href = trim($matches[1]);
    if (($i = strpos($href, ':')) !== FALSE) {
        if (strrpos($href, '/', $i) === FALSE) {
            if (!in_array(strtolower(substr($href, 0, $i)), $allowed_uri_schemes))
                return $matches[0];
        }
    }

    // unescape `\]`, `\\\]`, `\\\\\]`, etc.
    for ($j = strpos($href, '\\]'); $j !== FALSE; $j = strpos($href, '\\]', $j)) {
        for ($i = $j - 2; $i >= 0 && $href[$i] == '\\' && $href[$i + 1] == '\\'; $i -= 2)
            /* empty */;
        $i += 2;

        $h = '';
        if ($i > 0)
            $h = substr($href, 0, $i);
        for ($numBackslashes = floor(($j - $i)/2); $numBackslashes > 0; --$numBackslashes)
            $h .= '\\';
        $h .= ']';
        if (($j + 2) < strlen($href))
            $h .= substr($href, $j + 2);
        $href = $h;
        $j = $i + floor(($j - $i)/2) + 1;
    }

    if (!empty($matches[2]))
        $href .= str_replace('\\\\', '\\', $matches[2]);

    if (empty($matches[3]))
        $linkText = $href;
    else {
        $linkText = trim($matches[3]);
        if (empty($linkText))
            $linkText = $href;
    }
    $href = htmlspecialchars(encode_uri(htmlspecialchars_decode($href)));
    return "<a href=\"$href\" rel=\"nofollow\">$linkText</a>";
}

function render($input)
{
    $input = htmlspecialchars(strip_tags('' . $input));
    $input = preg_replace_callback('~\[url=((?:[^\]]|(?<!\\\\)(?:\\\\\\\\)*\\\\\])*)((?<!\\\\)(?:\\\\\\\\)*)\]' . '((?:[^[]|\[(?!/)|\[/(?!u)|\[/u(?!r)|\[/ur(?!l)|\[/url(?!\]))*)' . '\[/url\]~i', 'url_preg_replace_callback', $input);
    return $input;
}

which I believe is safe against XSS. This version has the added benefit that it is possible to write out links to URLs that contain ']'.

Evaluate this code with the following "test suite":

echo render('[url=http://www.bing.com/][[/[/u[/ur[/urlBing[/url]') . "\n";
echo render('[url=][/url]') . "\n";
echo render('[url=http://www.bing.com/][[/url]') . "\n";
echo render('[url=http://www.bing.com/][/[/url]') . "\n";
echo render('[url=http://www.bing.com/][/u[/url]') . "\n";
echo render('[url=http://www.bing.com/][/ur[/url]') . "\n";
echo render('[url=http://www.bing.com/][/url[/url]') . "\n";
echo render('[url=http://www.bing.com/][/url][/url]') . "\n";
echo render('[url=    javascript: window.alert("hi")]click me[/url]') . "\n";
echo render('[url=#" onclick="window.alert(\'hi\')"]click me[/url]') . "\n";
echo render('[url=http://www.bing.com/]       [/url]') . "\n";
echo render('[url=/?#[\\]@!$&\'()*+,;=.~]       [/url]') . "\n"; // link text should be `/?#[]@!$&amp;'()*+,;=.~`
echo render('[url=http://localhost/\\\\]d]abc[/url]') . "\n"; // href should be `http://localhost/%5C`, link text should be `d]abc`
echo render('[url=\\]][/url]') . "\n"; // link text should be `]`
echo render('[url=\\\\\\]][/url]') . "\n"; // link text should be `\]`
echo render('[url=\\\\\\\\\\]][/url]') . "\n"; // link text should be `\\]`
echo render('[url=a\\\\\\\\\\]bcde\\]fgh\\\\\\]ijklm][/url]') . "\n"; // link text should be `a\\]bcde]fgh\]ijklm`

Or, just look at the Codepad results.

As you can see, it works.

Daniel Trebbien
Hello, and thanks for the reply. This may not be an issue with the page I am writing now since it's a web app where the users can alter the content and add and delete galleries or blog posts. However, the users are people designated by the super admin, so people can't sign up and get access to commenting.If I decide to expand on the model to allow signing up, this becomes a much larger problem. The O reilly book mentions stripping tags and something I believe Is called real_escape_string that detaints text.Can people still write harmful text after this or is it the link form that allows it?
Rakoon
@Rakoon: `mysql_real_escape_string` and variants are for preventing a different class of attacks called *SQL injection*. Just as bad, but different. This code is for escaping text to prevent XSS. It is always good to escape everything because you should assume that all user input is bad, even if it originates from "trusted users". What if a cracker manages to obtain a trusted user's login credentials, for example? Use the `render` function to make all "code"-enabled comments and blog posts safe.
Daniel Trebbien
I must admit that I am still a bit too dumb to understand the entirety of your code. Is it possible to replace "javascript:" with "" for instance? Or does this just mean that the hacker uses workarounds? I have the impression that most of the wizards that answer my questions on this page could hack my site within seconds, regardless.I should try to read up on the different code you are using to try to become more enlightened.Basically I should Use EscapeShellArg and Real escape, then do an XSS untaint?
Rakoon
@Rakoon: Just `javascript:` alone, no, because there's `vbscript:` and any number of other script-related URI schemes that are yet to be invented, and `data:`, which might be exploitable. In this case, it's better to whitelist by using `$allowed_uri_schemes`. Also, "XSS untaint" basically means `htmlspecialchars`, but this does not allow *any* formatting. Thus, there are markup schemes like BBcode and Markdown, or even the one you invented (sample input: `[url=http://www.bing.com]Bing[/url] is a search engine.`). By the way, `EscapeShellArg` is for a different problem: command injection.
Daniel Trebbien
Tried running some xss cheat sheet scripts in my text input and have been troubleshooting quite a bit. One problem that occurs though is that Escapeshellarg removes æ ø and å from my text when the text is displayed. Btw thanks for all the help. You have saved me a great deal of time :)
Rakoon
You're welcome. I'm glad that I could help.
Daniel Trebbien
Would you know why æ ø and å are removed by escape shell arg?
Rakoon
By "when the text is displayed", do you mean in the web browser? If so, then you may need to set the HTML document's charset. Do: `header('Content-Type: text/html; charset=UTF-8');` For me, `escapeshellarg` does not remove æ ø and å: http://codepad.org/4KWeTJAZ
Daniel Trebbien
Tried changing to UTF-8. Still doesn't work. This is when I use escape shell arg on user submitted text before it is displayed in blog posts. UTF-8 only makes ø æ and å display as diamonds with "?" inside.
Rakoon
Hmmm. That shouldn't happen. UTF-8 (Unicode) should make diamonds and question marks *go away*.
Daniel Trebbien
I am currently using iso-8859-1 which displays western european signs correctly, whilst UTF-8 does not. I heard that UTF should show everything, so something else must be causing it.
Rakoon