tags:

views:

189

answers:

5

In a PHP application I am writing, I would like to have users enter in text a mix of HTML and text with pointed-brackets, but when I display this text, I want to let the HTML tags be rendered by the non-HTML tags be shown literary, e.g. a user should be able to enter:

<b> 5 > 3 = true</b>

when displayed, the user should see:

5 > 3 = true

What is the best way to parse this, i.e. find all the non-HTML brackets, convert them to &gt; and &lt;?

+2  A: 

I'd recommend having the users enter BBcode style markup which you then replace with the html tags:

[b]This is bold[/b]
[i]this is italic with a > 'greater than' sign there[/i]

This gives you more control over how you parse user's input into html, though I admit it looks like an unnecessary burden.

philistyne
This doesn't really get to the root of the problem, though, which is that if he wants to allow HTML/BBCode, he is going to have to deal with XSS. Writing a good BBCode parser is not trivial; writing a good HTML parser even more so.
Edward Z. Yang
A: 

The best way would be to do the opposite: instead of finding the non-HTML brackets and escaping them, first escape everything and then look for &lt;b&gt; and &lt;/b&gt; and unescape only these special cases. This way you do not risk a user injecting malicious HTML in your page (if you try to escape only what is needed, you risk missing something important).

CesarB
+1  A: 

You should look at making use of HTML Purifier too.

dylanfm
A: 

There are PEAR and PECL librarys that implement BBCode for you.

Grayside
+1  A: 

If you're allowing user input HTML, you've got to solve a far bigger problem than a few unescaped angled brackets; HTML is really tough to validate and filter properly, and if you don't do it right you open yourself up to XSS attacks. I've written a library that does this; someone else already posted a link to it here so I won't reiterate.

To answer your question, however, the most foolproof way of converting stray angled brackets to their escaped forms is parsing the HTML with DOM/libxml, and then reserializing it. Anything that uses regexes or such will be doomed to failure by edge-case. You could also write your own parser but that also takes a bit of work.

Edward Z. Yang