ansaurus

Question

Answer 1

+4 A:

test a few out, you either get an inefficient mess, or you get poor code with XSS holes

Hell yeah. I've not met a bbcode implementation yet that wasn't XSS-vulnerable.

'<a href="\1">\1</a>'

No good: fails to HTML-escape ‘<’, ‘&’ and ‘"’ characters.

Is it better practice, in this example, to only match using a uri/url validation expression? Or, is it better to use (.*?) and a callback, then ascertain whether or not the input is a valid link?

I would take the callback. You need the callback anyway to do the HTML-escaping; it's not possible to be secure with only simple string replacement. Drop the sanitisation in whilst you're doing it.

What about functions like urlencode() within a callback

Nearly; actually you need htmlspecialchars(). urlencode() is about encoding query parameters, which isn't what you need here.

Would it be safer to write a full-stack parser?

Yes.

bbcode is not really amenable to regex parsing, because it's a recursive tag-based language (like XML, which regex also cannot parse). Many bbcode holes are caused by nesting and misnesting problems. For example:

[url]http://www.example.com/[i][/url]foo[/i]

Could come out as something like

<a href="http://www.example.com/&amp;lt;i&gt;"&gt;foo&lt;/i&gt;

there are many other traps that generate broken code (up to an including XSS holes) on various bbcode implementations.

I'm looking for principles and best practices

If you need a bbcode-like language that you can regex, you need to:

reduce the number of possible tags that can be put inside other tags. Arbitrary nesting isn't really possible to support
use special characters for ‘<’ and ‘>’ HTML tag delimiters, to distinguish them from real angle brackets that should appear as such in the text. I use ASCII control codes (having previously filtered any control characters out at the user input stage).
split the string being processed on these control characters on content between these two control characters, so that you never end up letting a bbcode span reach inside a tag or over a tag boundary.
because you can't have bbcode spans reaching through tag boundaries work from the outside in, doing large block elements first and working inwards to links and finally bold and italic.
for sanity, process a block at a time. eg. If you're starting a new <p> on a double-newline, no bbcode tags can span between the two separate blocks.

It's still damned hard to get right. A proper parser is much more likely to be watertight.

bobince 2009-04-09 17:27:19

Hmm, I do agree with you in what you said, but I haven't had much skill in making a proper parser. Know of any decent tutorials for XML-esque parsing? I've found it difficult to find a good one that isn't overly complicated, yet still on the skill level necessary.

The Wicked Flea 2009-04-09 17:54:58

If you can't find a third-party parser library that satisfies your needs, you could do it by hand: first preg_split-with-PREG_SPLIT_DELIM_CAPTURE over the string with something like ‘\[[^\]]+\]’ to pick out the tags, then walk through the list keeping a stack of opened tags.

bobince 2009-04-10 23:45:21

(Even-numbered indexes in the list would be text, odd-numbered ones tags. Normally text would get HTML-escaped, and maybe have smileys autoreplaced if you're doing that, but some tags might change that.)

bobince 2009-04-10 23:47:09

ansaurus

tags:

views:

answers:

Regex and the "war" on XSS

related questions