views:

66

answers:

1

Hi all,

I was just wondering whether there are any resources that discusses processing html document structures. For example, if i have a page from the New York Times, and for any page, i would like to understand where is the main article, where are the important elements in the page. For some websites, the raw html document gives some indication for this type of processing. For other sites, generally all it gives is formatting tags (fonts etc). I have looked at OCR technologies, but most of those are used to recognize individual elements, and this is a slightly different problem altogether than OCR.

If anyone has any insights regarding this topic, it would be greatly appreciated!

+1  A: 

What you are looking for is called 'screen scraping' or 'data scraping' — a google search will get you a bunch of results for this. Here's a link from wikipedia: Web Scraping

You could build something on top of an HTML parser like hpricot

cloudhead