processing html document structure | ansaurus

tags:

views:

66

answers:

1

Q:

processing html document structure

Hi all,

I was just wondering whether there are any resources that discusses processing html document structures. For example, if i have a page from the New York Times, and for any page, i would like to understand where is the main article, where are the important elements in the page. For some websites, the raw html document gives some indication for this type of processing. For other sites, generally all it gives is formatting tags (fonts etc). I have looked at OCR technologies, but most of those are used to recognize individual elements, and this is a slightly different problem altogether than OCR.

If anyone has any insights regarding this topic, it would be greatly appreciated!

+1 A:

What you are looking for is called 'screen scraping' or 'data scraping' — a google search will get you a bunch of results for this. Here's a link from wikipedia: Web Scraping

You could build something on top of an HTML parser like hpricot

cloudhead 2009-07-06 17:55:36

related questions

Autosizing Textarea

Regular expression for parsing links from a webpage?

What are good tools for creating compiled HTML help files (.chm)?

Looking for WYSIWYG HTML editor

Any reason not to start using the HTML 5 doctype?

HTML comments break down

HTML Comments Markup

Setting a div's height in HTML with CSS

Wrapping lists into columns

Is a "Confirm Email" input good practice when user changes email address?

<XMP> Tag

HTML version choice

Options for HTML scraping?

How do you disable browser Autocomplete on web form field / input tag?

How do I make a checkbox toggle from clicking on the text label as well?

Html CSS Editor

Wordpress theme development offline tools

How do I give my web sites an icon for iPhone?

In HTML, how to word-break on a dash?

Detecting font in JavaScript

How do you test layout design across multiple browsers/OSs?

How do I print an HTML document from a web service?

Multiple submit buttons on a HTML form

How can I determine a web user's time zone?

Why doesn't the percentage width child in absolutely positioned parent work in IE7?