views:

70

answers:

2

I'm fetching data from different RSS / ATOM feeds and sometimes the HTML data I receive contains HTML tags but they dont have close tags or some other issues and it screws up the page layout / styling.

Somethings there is class name / id clash. Is there any way to sanitize it?

If anybody can point me to some reliable Javascript / Java implementation.

+1  A: 

You can give JTidy a try.

JTidy can be used as a tool for cleaning up malformed and faulty HTML.

Another option is HTML Cleaner

HTML found on Web is usually dirty, ill-formed and unsuitable for further processing. For any serious consumption of such documents, it is necessary to first clean up the mess and bring the order to tags, attributes and ordinary text. For the given HTML document, HtmlCleaner reorders individual elements and produces well-formed XML. By default, it follows similar rules that the most of web browsers use in order to create Document Object Model. However, user may provide custom tag and rule set for tag filtering and balancing.

akf
A: 

I have used NekoHTML with great success. It's just a thin layer over the Apache parser that puts it into error-correcting mode, which is a great architecture as every time Apache gets better so does Neko. And there's no huge amount of extra code.

EJP