views:

15

answers:

1

Given a URL, the URL of the webpage that first URL is on, the DOM of the webpage, and a list of the rest of the URLs on the webpage how can I reliably determine if the URL is in the header/footer of the page or if it's in neither?

I'm using C#/.NET.

I know that no solution is perfect since webpages are not semantically expressed and also because some websites/pages specifically obfuscate their pages, but I would like to build some logic that would work for say 75% of webpages.

Also, are there other pieces of information that would be helpful to determine the location of the URL in the page?

A: 

I think the creative task here is to define "header" and "footer", as in "content less than x units away from the top", or "the last 200 characters on the page". Once you have accomplished this, you can parse the page based on those rules.

cdonner
Yeah, that's exactly what the question is asking for... heuristics (one of the question's tags) to label a URL as being in the header or footer. I know I need to define these very broad ideas. I'm looking from everything simple (e.g. One of the first x links on a page) to very complex (backtracking in the DOM looking for containers that look like headers and footers).I would like to emphasize simple heuristics as I'm looking for 75% of sites. This 75% is what I consider, well-behaving pages. I'm not going to spend 90% of my time on the other 25% of pages. Thanks.
Chad
Furthermore, I want "header" and "footer" to be what you typically consider a header and footer on a webpage. It tends to be obvious when you look at a page, but obviously not immediately apparent when just looking at the HTML of a page.This is part of the challenge of the question, I want to try to identify heuristics that can tag a URL as being in the header/footer. **I don't want to constrain the idea of a header/footer, rather I want to adapt to each page as best as possible**.
Chad