ansaurus

Question

best way to extract elements from a html page?

Answer 1

+4 A:

I'd use a library like HTML Parser for this job. Have a look at the samples and/or the javadoc. Also have a look at previous questions here on SO.

HTML Parser is pretty easy to use and should do the job. For alternatives, have a look at this previous answer.

Pascal Thivent 2010-01-06 22:54:35

is it different that HtmlUnit? looks similiar.

mrblah 2010-01-06 23:00:39

HtmlUnit is a testing tool. HTML Parser is... a parser. So yes, they are different.

Pascal Thivent 2010-01-06 23:02:20

true, but HtmlUnit does have parser type methods, but I get your point!

mrblah 2010-01-06 23:09:43

Well, HtmlUnit need indeed to parse HTML to make assertion on it but the suggested tools allow to do advanced manipulations, to clean crappy html, etc. Just have a look at the API, you'll see. They really have different purpose.

Pascal Thivent 2010-01-06 23:13:08

Say you have a HTML page, how could you get a collection of the above (see question) html? I have maybe 10-20 <tr></tr> sets in my HTML, how would I get that with htmlparser?

mrblah 2010-01-06 23:18:11

You could use a filter, or a visitor (as documented on its website). Have a look at the javadoc of NodeVisitor for example (http://htmlparser.sourceforge.net/javadoc/org/htmlparser/visitors/NodeVisitor.html) and try it. Also, Have a look at the samples (http://htmlparser.sourceforge.net/samples.html).

Pascal Thivent 2010-01-06 23:38:34

Answer 2

+3 A:

JTidy does an excellent job of parsing HTML and making it available for manipulation as a DOM. Regular expressions are generally not the way to go, since HTML isn't regular and have numerous edge cases to trip you up.

Brian Agnew 2010-01-06 22:56:49

man with Java, you have SO many options, its crazy!

mrblah 2010-01-06 22:59:56

ansaurus

tags:

views:

answers:

best way to extract elements from a html page?

related questions