views:

1315

answers:

10

I'm planning to write a simple J2SE application to aggregate information from multiple web sources.

The most difficult part, I think, is extraction of meaningful information from web pages, if it isn't available as RSS or Atom feeds. For example, I might want to extract a list of questions from stackoverflow, but I absolutely don't need that huge tag cloud or navbar.

What technique/library would you advice?

Updates/Remarks

  • Speed doesn't matter — as long as it can parse about 5MB of HTML in less than 10 minutes.
  • It sould be really simple.
A: 

In short, you may either parse the whole page and pick things you need(for speed I recommend looking at SAXParser) or running the HTML through a regexp that trims of all of the HTML... you can also convert it all into DOM, but that's going to be expensive especially if you're shooting for having a decent throughput.

A: 

You seem to want to screen scrape. You would probably want to write a framework which via an adapter / plugin per source site (as each site's format will differ), you could parse the html source and extract the text. you would prob use java's io API to connect to the URL and stream the data via InputStreams.

+2  A: 

You could look at how httpunit does it. They use couple of decent html parsers, one is nekohtml. As far as getting data you can use whats built into the jdk (httpurlconnection), or use apache's

http://hc.apache.org/httpclient-3.x/

James Law
Neko sounds good — I like cats :)
AnSGri
+3  A: 

You may use HTMLParser (http://htmlparser.sourceforge.net/)in combination with URL#getInputStream() to parse the content of HTML pages hosted on Internet.

jatanp
A: 

If you want to do it the old fashioned way , you need to connect with a socket to the webserver's port , and then send the following data :

GET /file.html HTTP/1.0
Host: site.com
<ENTER>
<ENTER>

then use the Socket#getInputStream , and then read the data using a BufferedReader , and parse the data using whatever you like.

Vhaerun
Is it really the best way to get a page?
AnSGri
+2  A: 

If you want to take advantage of any structural or semantic markup, you might want to explore converting the HTML to XML and using XQuery to extract the information in a standard form. Take a look at this IBM developerWorks article for some typical code, excerpted below (they're outputting HTML, which is, of course, not required):

<table>
{
  for $d in //td[contains(a/small/text(), "New York, NY")]
  for $row in $d/parent::tr/parent::table/tr
  where contains($d/a/small/text()[1], "New York")
  return <tr><td>{data($row/td[1])}</td> 
           <td>{data($row/td[2])}</td>              
           <td>{$row/td[3]//img}</td> </tr>
}
</table>
Joe Liversedge
A: 

You can use nekohtml to parse your html document. You will get a DOM document. You may use XPATH to retrieve data you need.

Alexandre Victoor
A: 

If your "web sources" are regular websites using HTML (as opposed to structured XML format like RSS) I would suggest to take a look at HTMLUnit.

This library, while targeted for testing, is a really general purpose "Java browser". It is built on a Apache httpclient, Nekohtml parser and Rhino for Javascript support. It provides a really nice API to the web page and allows to traverse website easily.

Maxim
A: 

Have you considered taking advantage of RSS/Atom feeds? Why scrape the content when it's usually available for you in a consumable format? There are libraries available for consuming RSS in just about any language you can think of, and it'll be a lot less dependent on the markup of the page than attempting to scrape the content.

If you absolutely MUST scrape content, look for microformats in the markup, most blogs (especially WordPress based blogs) have this by default. There are also libraries and parsers available for locating and extracting microformats from webpages.

Finally, aggregation services/applications such as Yahoo Pipes may be able to do this work for you without reinventing the wheel.

Eric DeLabar
+1  A: 

I agree with Maxim; HTMLUnit is very useful for this kind of work.

PaulF