ansaurus

Question

Trouble with regular expression for comments code

Answer 1

+2 A:

I experimented a bit with the following:

function text_format($string)
{
    return preg_replace('#\[url=([^\]]+)\]([^\[]*)\[/url\]#', '<a href="$1">$2</a>', $string);
}

However, one immediate fault with this is that if linktext is empty, there will be nothing between <a> and </a>. One way around it would be to do another pass with something like this:

preg_replace('#<a href="([^"]+)"></a>#', '<a href="$1">$1</a>', $string);

Another option would be to use preg_replace_callback and put this logic inside your callback function.

Finally, this is obviously a common "problem" and has been solved many times by others, and if using a more mature open sourced solution is an option, I'd recommend looking for one.

Lauri Lehtinen 2010-06-21 22:20:17

Thank you very much for the quick answer. Reg exp is really confusing for a noob, and probably still confusing regardless after years of programming experience. My brain melts just by looking at the mass of brackets :)You are god, the code works like a charm and thank you very much for the help.

Rakoon 2010-06-21 22:25:56

@Rakoon - No need to get religious on us. Experience plays a small role here I think. :)

ChaosPandion 2010-06-21 22:29:53

True. But after having wrung my brains out for quite a while, an immediate, and intelligent answer to a tricky problem (for a noob) made me just a bit religiously inclined :)

Rakoon 2010-06-21 22:37:15

Answer 2

+4 A:

It looks like you're using something similar to BBCode. Why not use a BBCode parser, such as this one?

http://nbbc.sourceforge.net/

It also handles smilies, replacing them with images. If you use their test page, you will still see the text though, because they don't host the images and they set the alt-text to the smily.

jdmichal 2010-06-21 22:21:07

Hmm. A good idea, but i got the parsing to work, and don't need more than a few simple formatting options. The smileys are custom, and built into a themes system I made, so that one can have theme independent smileys.Anyways, thanks for the answer.

Rakoon 2010-06-21 22:28:36

Answer 3

+2 A:

@Lauri Lehtinen's answer is good for learning the idea behind the technique, but you shouldn't use it in practice because it would make your site extremely vulnerable to XSS attacks. Also, link spammers would appreciate the lack of rel="nofollow" on the generated links.

Instead, use something like:

<?php
// \author Daniel Trebbien
// \date 2010-06-22
// \par License
//  Public Domain

$allowed_uri_schemes = array('http', 'https', 'ftp', 'ftps', 'irc', 'mailto');

/**
 * Encodes a string in RFC 3986
 *
 * \see http://tools.ietf.org/html/rfc3986
 */
function encode_uri($str)
{
    $str = urlencode('' . $str);
    $search = array('%3A', '%2F', '%3F', '%23', '%5B', '%5D', '%40', '%21', '%24', '%26', '%27', '%28', '%29', '%2A', '%2B', '%2C', '%3B', '%3D', '%2E', '%7E');
    $replace = array(':', '/', '?', '#', '[', ']', '@', '!', '$', '&', '\'', '(', ')', '*', '+', ',', ';', '=', '.', '~'); // gen-delims / sub-delims / unreserved
    return str_ireplace($search, $replace, $str);
}

function url_preg_replace_callback($matches)
{
    global $allowed_uri_schemes;

    if (empty($matches[1]))
        return $matches[0];
    $href = trim($matches[1]);
    if (($i = strpos($href, ':')) !== FALSE) {
        if (strrpos($href, '/', $i) === FALSE) {
            if (!in_array(strtolower(substr($href, 0, $i)), $allowed_uri_schemes))
                return $matches[0];
        }
    }

    // unescape `\]`, `\\\]`, `\\\\\]`, etc.
    for ($j = strpos($href, '\\]'); $j !== FALSE; $j = strpos($href, '\\]', $j)) {
        for ($i = $j - 2; $i >= 0 && $href[$i] == '\\' && $href[$i + 1] == '\\'; $i -= 2)
            /* empty */;
        $i += 2;

        $h = '';
        if ($i > 0)
            $h = substr($href, 0, $i);
        for ($numBackslashes = floor(($j - $i)/2); $numBackslashes > 0; --$numBackslashes)
            $h .= '\\';
        $h .= ']';
        if (($j + 2) < strlen($href))
            $h .= substr($href, $j + 2);
        $href = $h;
        $j = $i + floor(($j - $i)/2) + 1;
    }

    if (!empty($matches[2]))
        $href .= str_replace('\\\\', '\\', $matches[2]);

    if (empty($matches[3]))
        $linkText = $href;
    else {
        $linkText = trim($matches[3]);
        if (empty($linkText))
            $linkText = $href;
    }
    $href = htmlspecialchars(encode_uri(htmlspecialchars_decode($href)));
    return "<a href=\"$href\" rel=\"nofollow\">$linkText</a>";
}

function render($input)
{
    $input = htmlspecialchars(strip_tags('' . $input));
    $input = preg_replace_callback('~\[url=((?:[^\]]|(?<!\\\\)(?:\\\\\\\\)*\\\\\])*)((?<!\\\\)(?:\\\\\\\\)*)\]' . '((?:[^[]|\[(?!/)|\[/(?!u)|\[/u(?!r)|\[/ur(?!l)|\[/url(?!\]))*)' . '\[/url\]~i', 'url_preg_replace_callback', $input);
    return $input;
}

which I believe is safe against XSS. This version has the added benefit that it is possible to write out links to URLs that contain ']'.

Evaluate this code with the following "test suite":

echo render('[url=http://www.bing.com/][[/[/u[/ur[/urlBing[/url]') . "\n";
echo render('[url=][/url]') . "\n";
echo render('[url=http://www.bing.com/][[/url]') . "\n";
echo render('[url=http://www.bing.com/][/[/url]') . "\n";
echo render('[url=http://www.bing.com/][/u[/url]') . "\n";
echo render('[url=http://www.bing.com/][/ur[/url]') . "\n";
echo render('[url=http://www.bing.com/][/url[/url]') . "\n";
echo render('[url=http://www.bing.com/][/url][/url]') . "\n";
echo render('[url=    javascript: window.alert("hi")]click me[/url]') . "\n";
echo render('[url=#" onclick="window.alert(\'hi\')"]click me[/url]') . "\n";
echo render('[url=http://www.bing.com/]       [/url]') . "\n";
echo render('[url=/?#[\\]@!$&\'()*+,;=.~]       [/url]') . "\n"; // link text should be `/?#[]@!$&amp;'()*+,;=.~`
echo render('[url=http://localhost/\\\\]d]abc[/url]') . "\n"; // href should be `http://localhost/%5C`, link text should be `d]abc`
echo render('[url=\\]][/url]') . "\n"; // link text should be `]`
echo render('[url=\\\\\\]][/url]') . "\n"; // link text should be `\]`
echo render('[url=\\\\\\\\\\]][/url]') . "\n"; // link text should be `\\]`
echo render('[url=a\\\\\\\\\\]bcde\\]fgh\\\\\\]ijklm][/url]') . "\n"; // link text should be `a\\]bcde]fgh\]ijklm`

Or, just look at the Codepad results.

As you can see, it works.

Daniel Trebbien 2010-06-21 23:41:52

Hello, and thanks for the reply. This may not be an issue with the page I am writing now since it's a web app where the users can alter the content and add and delete galleries or blog posts. However, the users are people designated by the super admin, so people can't sign up and get access to commenting.If I decide to expand on the model to allow signing up, this becomes a much larger problem. The O reilly book mentions stripping tags and something I believe Is called real_escape_string that detaints text.Can people still write harmful text after this or is it the link form that allows it?

Rakoon 2010-06-22 11:24:06

@Rakoon: `mysql_real_escape_string` and variants are for preventing a different class of attacks called *SQL injection*. Just as bad, but different. This code is for escaping text to prevent XSS. It is always good to escape everything because you should assume that all user input is bad, even if it originates from "trusted users". What if a cracker manages to obtain a trusted user's login credentials, for example? Use the `render` function to make all "code"-enabled comments and blog posts safe.

Daniel Trebbien 2010-06-22 14:58:23

I must admit that I am still a bit too dumb to understand the entirety of your code. Is it possible to replace "javascript:" with "" for instance? Or does this just mean that the hacker uses workarounds? I have the impression that most of the wizards that answer my questions on this page could hack my site within seconds, regardless.I should try to read up on the different code you are using to try to become more enlightened.Basically I should Use EscapeShellArg and Real escape, then do an XSS untaint?

Rakoon 2010-06-22 16:37:43

@Rakoon: Just `javascript:` alone, no, because there's `vbscript:` and any number of other script-related URI schemes that are yet to be invented, and `data:`, which might be exploitable. In this case, it's better to whitelist by using `$allowed_uri_schemes`. Also, "XSS untaint" basically means `htmlspecialchars`, but this does not allow *any* formatting. Thus, there are markup schemes like BBcode and Markdown, or even the one you invented (sample input: `[url=http://www.bing.com]Bing[/url] is a search engine.`). By the way, `EscapeShellArg` is for a different problem: command injection.

Daniel Trebbien 2010-06-22 17:49:07

Tried running some xss cheat sheet scripts in my text input and have been troubleshooting quite a bit. One problem that occurs though is that Escapeshellarg removes æ ø and å from my text when the text is displayed. Btw thanks for all the help. You have saved me a great deal of time :)

Rakoon 2010-06-23 20:07:14

You're welcome. I'm glad that I could help.

Daniel Trebbien 2010-06-23 20:28:15

Would you know why æ ø and å are removed by escape shell arg?

Rakoon 2010-06-23 21:21:23

By "when the text is displayed", do you mean in the web browser? If so, then you may need to set the HTML document's charset. Do: `header('Content-Type: text/html; charset=UTF-8');` For me, `escapeshellarg` does not remove æ ø and å: http://codepad.org/4KWeTJAZ

Daniel Trebbien 2010-06-23 22:54:38

Tried changing to UTF-8. Still doesn't work. This is when I use escape shell arg on user submitted text before it is displayed in blog posts. UTF-8 only makes ø æ and å display as diamonds with "?" inside.

Rakoon 2010-06-23 23:06:03

Hmmm. That shouldn't happen. UTF-8 (Unicode) should make diamonds and question marks *go away*.

Daniel Trebbien 2010-06-23 23:17:40

I am currently using iso-8859-1 which displays western european signs correctly, whilst UTF-8 does not. I heard that UTF should show everything, so something else must be causing it.

Rakoon 2010-06-24 00:10:58

ansaurus

tags:

views:

answers:

Trouble with regular expression for comments code

related questions