tags:

views:

93

answers:

2

What is the preferred way to extract elements from a HTML page in Java?

My HTML is has many of the following rows:

<tr class="item-odd">
       <td class="data"><a href="http://....."&gt;TITLE&lt;/a&gt;&lt;/td&gt;
       <td><div class="cost">$1.99</div></td>
</tr>

The class alternates item-odd and item-even.

I need to extract:

  1. Url
  2. Title
  3. price

Is regular expressions the way to go?

+4  A: 

I'd use a library like HTML Parser for this job. Have a look at the samples and/or the javadoc. Also have a look at previous questions here on SO.

HTML Parser is pretty easy to use and should do the job. For alternatives, have a look at this previous answer.

Pascal Thivent
is it different that HtmlUnit? looks similiar.
mrblah
HtmlUnit is a testing tool. HTML Parser is... a parser. So yes, they are different.
Pascal Thivent
true, but HtmlUnit does have parser type methods, but I get your point!
mrblah
Well, HtmlUnit need indeed to parse HTML to make assertion on it but the suggested tools allow to do advanced manipulations, to clean crappy html, etc. Just have a look at the API, you'll see. They really have different purpose.
Pascal Thivent
Say you have a HTML page, how could you get a collection of the above (see question) html? I have maybe 10-20 <tr></tr> sets in my HTML, how would I get that with htmlparser?
mrblah
You could use a filter, or a visitor (as documented on its website). Have a look at the javadoc of NodeVisitor for example (http://htmlparser.sourceforge.net/javadoc/org/htmlparser/visitors/NodeVisitor.html) and try it. Also, Have a look at the samples (http://htmlparser.sourceforge.net/samples.html).
Pascal Thivent
+3  A: 

JTidy does an excellent job of parsing HTML and making it available for manipulation as a DOM. Regular expressions are generally not the way to go, since HTML isn't regular and have numerous edge cases to trip you up.

Brian Agnew
man with Java, you have SO many options, its crazy!
mrblah