tags:

views:

569

answers:

4

I need to pull data from an html page using Java code. The java part is required.

The page i am trying to pull info from is http://www.weather.gov/data/obhistory/KMCI.html .

I need to create a list of hashmaps...or some kind of data object that i can reference in later code.

This is all i have so far:

URL weatherDataKC = new URL("http://www.weather.gov/data/obhistory/KMCI.html");
InputStream is = weatherDataKC.openStream();
int cnt = 0;
StringBuffer buffer = new StringBuffer();

while ((cnt = is.read()) != -1){
    buffer.append((char) cnt);
}

System.out.print(buffer.toString());

Any suggestions where to start?

+2  A: 

Use an HTML parser like CyberNeko

Damo
+3  A: 

there is a nice HTML parser called Neko:

NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.

More information here.

dfa
A: 

HTML scraping is notoriously difficult, unless you have a lot of "hooks" like unique IDs. For example, the table you want starts with this HTML:

<table cellspacing="3" cellpadding="2" border="0" width="670">

...which is very generic and may match several tables on the page. The other problem is, what happens if the HTML structure changes? You'll have to redefine all your parsing rules...

DisgruntledGoat
Excellent point, but it sounds like this is homework, so it won't matter if it changes later. ;]
CPerkins
+1  A: 

J2SE includes HTML parsing capabilities, in packages javax.swing.text.html and javax.swing.text.html.parser. HTMLEditorKit.ParserCallback receives events pushed by DocumentParser (better be used through ParserDelegator). The framework is very similar to the SAX parsers for XML.

Beware, there are some bugs. It won't be able to handle bad HTML very well.


Dealing with colspan and rowspan is your business.

Marian
Thanks, this looks like a good place to start. And, though the CyberNeko seems interesting I was hoping to stay within libraries that we are already using.
aintnoprophet