ansaurus

Question

Read in html table to java

Answer 1

+2 A:

Use an HTML parser like CyberNeko

Damo 2009-08-17 17:03:03

Answer 2

+3 A:

there is a nice HTML parser called Neko:

NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.

More information here.

dfa 2009-08-17 17:03:24

Answer 3

A:

HTML scraping is notoriously difficult, unless you have a lot of "hooks" like unique IDs. For example, the table you want starts with this HTML:

<table cellspacing="3" cellpadding="2" border="0" width="670">

...which is very generic and may match several tables on the page. The other problem is, what happens if the HTML structure changes? You'll have to redefine all your parsing rules...

DisgruntledGoat 2009-08-17 17:13:26

Excellent point, but it sounds like this is homework, so it won't matter if it changes later. ;]

CPerkins 2009-08-17 20:25:55

Answer 4

+1 A:

J2SE includes HTML parsing capabilities, in packages javax.swing.text.html and javax.swing.text.html.parser. HTMLEditorKit.ParserCallback receives events pushed by DocumentParser (better be used through ParserDelegator). The framework is very similar to the SAX parsers for XML.

Beware, there are some bugs. It won't be able to handle bad HTML very well.

Dealing with colspan and rowspan is your business.

Marian 2009-08-17 23:02:49

Thanks, this looks like a good place to start. And, though the CyberNeko seems interesting I was hoping to stay within libraries that we are already using.

aintnoprophet 2009-08-18 14:44:55

ansaurus

tags:

views:

answers:

Read in html table to java

related questions