Hi all,
I was just wondering whether there are any resources that discusses processing html document structures. For example, if i have a page from the New York Times, and for any page, i would like to understand where is the main article, where are the important elements in the page. For some websites, the raw html document gives some indication for this type of processing. For other sites, generally all it gives is formatting tags (fonts etc). I have looked at OCR technologies, but most of those are used to recognize individual elements, and this is a slightly different problem altogether than OCR.
If anyone has any insights regarding this topic, it would be greatly appreciated!