ansaurus

Question

Best way to handle mixed HTML and in user input?

Answer 1

+2 A:

I'd recommend having the users enter BBcode style markup which you then replace with the html tags:

[b]This is bold[/b]
[i]this is italic with a > 'greater than' sign there[/i]

This gives you more control over how you parse user's input into html, though I admit it looks like an unnecessary burden.

philistyne 2008-11-19 09:30:53

This doesn't really get to the root of the problem, though, which is that if he wants to allow HTML/BBCode, he is going to have to deal with XSS. Writing a good BBCode parser is not trivial; writing a good HTML parser even more so.

Edward Z. Yang 2008-11-19 19:41:49

Answer 2

A:

The best way would be to do the opposite: instead of finding the non-HTML brackets and escaping them, first escape everything and then look for <b> and </b> and unescape only these special cases. This way you do not risk a user injecting malicious HTML in your page (if you try to escape only what is needed, you risk missing something important).

CesarB 2008-11-19 10:21:28

Answer 3

+1 A:

You should look at making use of HTML Purifier too.

dylanfm 2008-11-19 10:26:08

Answer 4

A:

There are PEAR and PECL librarys that implement BBCode for you.

Grayside 2008-11-19 19:25:49

Answer 5

+1 A:

If you're allowing user input HTML, you've got to solve a far bigger problem than a few unescaped angled brackets; HTML is really tough to validate and filter properly, and if you don't do it right you open yourself up to XSS attacks. I've written a library that does this; someone else already posted a link to it here so I won't reiterate.

To answer your question, however, the most foolproof way of converting stray angled brackets to their escaped forms is parsing the HTML with DOM/libxml, and then reserializing it. Anything that uses regexes or such will be doomed to failure by edge-case. You could also write your own parser but that also takes a bit of work.

Edward Z. Yang 2008-11-19 19:46:23

ansaurus

tags:

views:

answers:

Best way to handle mixed HTML and in user input?

related questions