@Lauri Lehtinen's answer is good for learning the idea behind the technique, but you shouldn't use it in practice because it would make your site extremely vulnerable to XSS attacks. Also, link spammers would appreciate the lack of rel="nofollow"
on the generated links.
Instead, use something like:
<?php
// \author Daniel Trebbien
// \date 2010-06-22
// \par License
// Public Domain
$allowed_uri_schemes = array('http', 'https', 'ftp', 'ftps', 'irc', 'mailto');
/**
* Encodes a string in RFC 3986
*
* \see http://tools.ietf.org/html/rfc3986
*/
function encode_uri($str)
{
$str = urlencode('' . $str);
$search = array('%3A', '%2F', '%3F', '%23', '%5B', '%5D', '%40', '%21', '%24', '%26', '%27', '%28', '%29', '%2A', '%2B', '%2C', '%3B', '%3D', '%2E', '%7E');
$replace = array(':', '/', '?', '#', '[', ']', '@', '!', '$', '&', '\'', '(', ')', '*', '+', ',', ';', '=', '.', '~'); // gen-delims / sub-delims / unreserved
return str_ireplace($search, $replace, $str);
}
function url_preg_replace_callback($matches)
{
global $allowed_uri_schemes;
if (empty($matches[1]))
return $matches[0];
$href = trim($matches[1]);
if (($i = strpos($href, ':')) !== FALSE) {
if (strrpos($href, '/', $i) === FALSE) {
if (!in_array(strtolower(substr($href, 0, $i)), $allowed_uri_schemes))
return $matches[0];
}
}
// unescape `\]`, `\\\]`, `\\\\\]`, etc.
for ($j = strpos($href, '\\]'); $j !== FALSE; $j = strpos($href, '\\]', $j)) {
for ($i = $j - 2; $i >= 0 && $href[$i] == '\\' && $href[$i + 1] == '\\'; $i -= 2)
/* empty */;
$i += 2;
$h = '';
if ($i > 0)
$h = substr($href, 0, $i);
for ($numBackslashes = floor(($j - $i)/2); $numBackslashes > 0; --$numBackslashes)
$h .= '\\';
$h .= ']';
if (($j + 2) < strlen($href))
$h .= substr($href, $j + 2);
$href = $h;
$j = $i + floor(($j - $i)/2) + 1;
}
if (!empty($matches[2]))
$href .= str_replace('\\\\', '\\', $matches[2]);
if (empty($matches[3]))
$linkText = $href;
else {
$linkText = trim($matches[3]);
if (empty($linkText))
$linkText = $href;
}
$href = htmlspecialchars(encode_uri(htmlspecialchars_decode($href)));
return "<a href=\"$href\" rel=\"nofollow\">$linkText</a>";
}
function render($input)
{
$input = htmlspecialchars(strip_tags('' . $input));
$input = preg_replace_callback('~\[url=((?:[^\]]|(?<!\\\\)(?:\\\\\\\\)*\\\\\])*)((?<!\\\\)(?:\\\\\\\\)*)\]' . '((?:[^[]|\[(?!/)|\[/(?!u)|\[/u(?!r)|\[/ur(?!l)|\[/url(?!\]))*)' . '\[/url\]~i', 'url_preg_replace_callback', $input);
return $input;
}
which I believe is safe against XSS. This version has the added benefit that it is possible to write out links to URLs that contain ']'
.
Evaluate this code with the following "test suite":
echo render('[url=http://www.bing.com/][[/[/u[/ur[/urlBing[/url]') . "\n";
echo render('[url=][/url]') . "\n";
echo render('[url=http://www.bing.com/][[/url]') . "\n";
echo render('[url=http://www.bing.com/][/[/url]') . "\n";
echo render('[url=http://www.bing.com/][/u[/url]') . "\n";
echo render('[url=http://www.bing.com/][/ur[/url]') . "\n";
echo render('[url=http://www.bing.com/][/url[/url]') . "\n";
echo render('[url=http://www.bing.com/][/url][/url]') . "\n";
echo render('[url= javascript: window.alert("hi")]click me[/url]') . "\n";
echo render('[url=#" onclick="window.alert(\'hi\')"]click me[/url]') . "\n";
echo render('[url=http://www.bing.com/] [/url]') . "\n";
echo render('[url=/?#[\\]@!$&\'()*+,;=.~] [/url]') . "\n"; // link text should be `/?#[]@!$&'()*+,;=.~`
echo render('[url=http://localhost/\\\\]d]abc[/url]') . "\n"; // href should be `http://localhost/%5C`, link text should be `d]abc`
echo render('[url=\\]][/url]') . "\n"; // link text should be `]`
echo render('[url=\\\\\\]][/url]') . "\n"; // link text should be `\]`
echo render('[url=\\\\\\\\\\]][/url]') . "\n"; // link text should be `\\]`
echo render('[url=a\\\\\\\\\\]bcde\\]fgh\\\\\\]ijklm][/url]') . "\n"; // link text should be `a\\]bcde]fgh\]ijklm`
Or, just look at the Codepad results.
As you can see, it works.