views:

57

answers:

2

Hi,

I am Looking for java html parser like (I know java well vs my bad php - in this way i want understand how html parser works) simple html dom in PHP.

thanks

+1  A: 

Try TagSoup, a SAX parser that takes in real-world messy HTML and triggers SAX XML events on your ContentHandler. I recommend using this with JDOM to build a JDOM Document that you can walk manually, or via XPath.

Harold L
+2  A: 

Try jsoup. It is the best HTML parser as far as I know.

chedine
@Yosef: my suggestion if you want to understand how the parser works is to download the source jar, and step through a parse in your IDE's debugger.Specifically, see the parse() method in http://github.com/jhy/jsoup/blob/master/src/main/java/org/jsoup/parser/Parser.javaIt is a recursive descent parser that looks at the next characters in the HTML queue, and depending on current context will create child elements, or text data, or pop up the element stack when a close tag is found.The simplicity of the parse is somewhat complicated by handling dodgy input HTML.(I'm the author of jsoup)
Jonathan Hedley