ansaurus

Question

How do I resolve the content of a webpage?

Answer 1

A:

Well, your question is a little bit vague still. In most cases, a "crawler" is going to just find data on the web in a text-format, and process it for storage, parsing, etc. The "Facebook Screenshot" thing is a different beast entirely.

If you're just looking for a web based crawler, there are several libraries that can be used to traverse the DOM of a web page very easily, and can grab content that you're looking for.

If you're using Python, try Beautiful Soup If you're using Ruby, try hpricot

If you want the entire contents of a webpage for processing at a later date, simply get and store everything underneat the "html" tag.

Here's a BeautifulSoup example to get all the links off a page:

require 'hpricot'
require 'open-uri'
doc = Hpricot(open("http://www.stackoverflow.com"))
(doc/"a").each do |link|
  puts link.attributes['href']
end

Edit: If you're going to primarily be grabbing content from the same sites (e.g. the comments section of Reddit, questions from StackOverflow, Digg links, etc) you can hardcode the format of them so your crawler can say, "Ok, I'm on Reddit, get everything with the class of 'thing'. You can also give it a list of default things to look for, such as divs with class/id of "main", "content", "center", etc.

Mike Trpcic 2009-08-05 13:59:10

Answer 2

+1 A:

There really is no standard way for web pages to mark "this is the meat". Most pages don't even want this because it makes stealing their core business easier. So you really have to write a framework which can use per-page rules to locate the content you want.

Aaron Digulla 2009-08-05 14:09:46

ansaurus

tags:

views:

answers:

How do I resolve the content of a webpage?

related questions