views:

17

answers:

1

Most of the webpages now-a-days contain lists of things, or chunks of html patterns that repeat a lot.

For example:

  1. Facebook status messages on homepages.
  2. Digg/Hacker News
  3. StackOverflow homepage

Is there a Java library for detecting such lists. It will involve some amount of pattern matching and intelligence. Thanks.

A: 

Between XPath expressions and HTML element "id" attributes you should be able to find the root of the lists you are interested in, and then more XPath will let you iterate over them.

If you don't have XPath already, I recommend using HtmlUnit. Yeah, it's meant for testing, but it works really well as "headless" browser and has excellent support for XPath-ing your way around the DOM of a page.

Rodney Gitzel