views:

143

answers:

4

Hi, (I've seen similar questions, but I think none of them cater to my specific needs, hence...)

I would like to know if there is a Java library for analysis of real-world (read: incomplete, ill-formed) HTML. By analysis, I mean things like:

  • figuring out the most prominent color in an HTML chunk
  • changing that color to some other color (hence, has to support modification of the HTML as well)
  • pruning out unwanted tags
  • fixing up the HTML to result in a well formed HTML snippet

Parts of the last two are done by libraries such as Jericho, and jTidy. 'Plugins' on top of these would be great.

Thanks in advance!

+1  A: 

Take a look at JTidy, a Java port of HTML Tidy. It will, depending on what options you choose, fix non-well-formed HTML and otherwise clean it up.

You'll need something else for the colour changing stuff.

cletus
Thanks. I'm aware of jTidy. I was looking for something that can do some more semantic analysis on an HTML fragment
Raj
+2  A: 

Well I would tidy it first into valid XML, then using XSLT do a conditional deep copy where I would do the most-prominent-color/pruning/whatever processing you need.

igor
+3  A: 

You might want to check out TagSoup:

http://home.ccil.org/~cowan/XML/tagsoup/

desau
I'll look into this, thanks!
Raj
None of the libraries offer semantic analysis much. But voted for this as Tagsoup is really impressive nevertheless
Raj
A: 

Maybe you will find something in this list (try TagSoup, NekoHTML, VietSpider HTMLParser).

dma_k