tags:

views:

95

answers:

4

I would like to implement content management system with RDBMS in java/j2ee, and would like to know the best practices for handling input HTML content

Below are the few doubts I have got, am sure there are lots of other things to take care..

  1. Do we need to escape HTML tags and special characters before we save HTML content to database
  2. How do we validate/remove invalid special symbols in large input HTML content
  3. Best practices for displaying HTML content back to browser from database
  4. Any security risk involved in while handling HTML content

Looking forward to see some great ideas from gurus!

A: 

I am not a guru in this but i think you will have to figure out how to deal with some special characters and escape sequences as in quotes(both double and single)..etc

May be you can try replacing those special charas and escape sequences with some other characters.

Mayb Someone else who is currenntly delaing with cms mite help you out..nways cheers!!

Richie
Hi Richie, thanks for quick reply
ramrajedotcom
+1  A: 

Use a tool like Neko to clean up the HTML into XHTML, then use any XML parser to parse it.

Sam Barnum
There are some interesting-looking classes in the javax.swing.text.html.parser package that may be useful for parsing messy HTML. http://java.sun.com/javase/6/docs/api/javax/swing/text/html/parser/package-summary.html
Sam Barnum
+1  A: 

I recently tried out some html clean-up libraries, and the best I came across was the Cobra Html Renderer and Parser which seems to faster than others and also manages to convert dirtier HTML do XHTML. I first went for HTML Tidy, but it ended up complaining about "Unparseable HTML" way too often.

What I'd strongly discourage you from doing is to use a REGEX ;-)

msparer
A: 

I would recommend looking at the architecture and design of an open source CMS like Alfresco or Apache Jackrabbit.

These are actual content repositories and will not contain end-to-end integration most likely, but can show you an underlying data model that is a good place to start.

I would also recommend you check out OWASP for information on web application security and vulnerabilities, and in particular security issues relevant to Java developers.

cwash