tags:

views:

260

answers:

4

In my web application I am developing a comment functionality, where user's can comment. But I am facing a problem which is I want to allow simple HTML tags in the comment box. HTML tags like <b>, <strong>, <i>, <em>, <u>, etc., that are normally allowed to enter in a commenting box. But then I also want when user presses enter then it will be automatically converted into breaks (<br /> tags) and get stored into database, so that when I'll display them in the web page then they'll look like as user entered.

Can you please tell me how to parse that user entered only allowed set of HTML tags and how to convert enters into <br /> tags and then store them in database.

Or if anyone have some better idea or suggestion to implement this kind of functionality. I am using ASP.NET 2.0 (C#)

I noticed that StackOverflow.com is doing the same thing on Profile Editing. When we edit our profile then below the "About Me" field "basic HTML allowed" line is written, I want to do almost the same functionality.

A: 

I'm not sure, but I think you have to escape html characters when inserting in database and when retrieving echo them unescaped, so the browser can see it just like html.

perfectDay
I don't have problem in storing the html chars in database, the problem is how to check that user entered only allowed HTML tags or not??
Prashant
+1  A: 

I don't have a C# specific answer for you, but you can go about it a few different ways. One is to let the user input whatever they want, then you run a filter over it to strip out the "bad" html. There are numerous open source filters that do this for PHP, Python, etc. In general, it's a pretty difficult problem, and it's best to let some well developed 3rd party code do this rather than write it yourself.

Another way to handle it is to allow the user to enter comments in some kind of simpler markup language like BBCode, Textile, or Markdown (stackoverflow is using Markdown), perhaps in conjunction with a nice Javascript editor. You then run the user's text through a processor for one of these markup languages to get the HTML. You can usually obtain implementations of these processors for whatever language you are using. These processors usually strip out the "bad" HTML.

Brian Neal
A: 

I don´t know asp.net, but in php there´s an easy function, strip_tags, that let you add exceptions (in your case, b, em, etc.). If there´s nothing like that in C# you can write a regular expression that strips out all tags except the allowed ones but chances are that such an expression already exists so it should be easy to find.

replacing \n (or something similar) with br shouldn´t be a problem either with a simple search and replace.

jeroen
A: 

This is a dangerous road to go down. You might think you can do some awesome regexes, or find someone who can help you with it, but sanitizing SOME markup and leaving other is just crazy talk.

I highly recommend you look into BBCode or another token system. Even something untokenized such as what SO uses, is probably a much better solution.

Chad Ruppert