views:

27

answers:

0

I'm looking for a list of heuristics, given an HTML document and/or a set of URLs on a web page, that will give a set of URLs that are previous/next links from that page. Also, assume that you are given the base URL. I do not require to know if a link is specifically a next or previous URL, just that it is one of those two.

I've got a short list going already:

  • Same domain and path as the URL, but different query parameters.
    • base: abc.com/story
    • next/previous: abc.com/story?p=2
      • or
    • base: abc.com/story.html?p=5
    • next/previous: abc.com/story.html?p=3
  • URL is the same as the base URL except a numerical path element.
    • base: abc.com/story
    • next/previous: abc.com/story/2
  • Several links nearby each other in the DOM/HTML.
    • I know this could also be like a header/footer, I would have to account for that somehow...any ideas?
  • Links whose text is a number or whose test is a word like "Next", "Previous", "First", "Last", "Back", "Forward", etc...

I know I can never be perfect at this, but I would like to get as much coverage and as many heuristics as I can to hope for a nice mix or quantity and quality. Thanks.