views:

866

answers:

5

I want to parse some HTML in order to find the values of some attributes/tags etc.

What HTML parsers do you recommend? Any pros and cons?

+6  A: 

I have tried HTML Parser which is dead simple.

pek
I have used HTML parser on a project and it worked exactly as expected
Craig Angus
but there is not much tutorials available...
Lily
+1  A: 

Do you need to do a full parse of the HTML? If you're just looking for specific values within the contents (a specific tag/param), then a simple regular expression might be enough, and could very well be faster.

Herms
A: 

@Herms yes, I agree that regex is good for simple text finding. Can you provide an example that will find the word "here" in the following text?

<tag>here</tag>
<tag attr="here">test</tag>
<here>test</here>

Also, just so this question includes everything, I would like to hear some library recommendations as well.

pek
+10  A: 

NekoHTML, TagSoup, and JTidy will allow you to parse HTML and then process with XML tools, like XPath.

jelovirt
XPath is the way for HTML parsing, it helps in case of bad formed HTML as well where regex fails.
Sumit Ghosh
A: 

I am newbie to HTML parsing.. I knew Java and HTML pretty well.. I come to know that HTMLParser is an easy tool to work with.. But there is limited resources available to learn and use it.. Can anyone suggest me where to start ?..

TAM
you should start a new question and reference this one.
Markus Lausberg