I'm planning to write a simple J2SE application to aggregate information from multiple web sources.
The most difficult part, I think, is extraction of meaningful information from web pages, if it isn't available as RSS or Atom feeds. For example, I might want to extract a list of questions from stackoverflow, but I absolutely don't need that huge tag cloud or navbar.
What technique/library would you advice?
Updates/Remarks
- Speed doesn't matter — as long as it can parse about 5MB of HTML in less than 10 minutes.
- It sould be really simple.