views:

65

answers:

2

I'm writing a special crawler-like application that needs to retrieve the main content of various pages. Just to clarify : I need the real "meat" of the page (providing there is one , naturally)

I have tried various approaches:

  1. Many pages have rss feeds , so I can read the feed and get this page specific contnent.
  2. Many pages use "content" meta tags
  3. In a lot of cases , the object presented in the middle of screen is the main "content" of the page

However , these methods don't always work , and I've noticed that Facebook do a mighty fine job doing just this (when you want to attach a link , they show you the content they've found on the link page) .

So - do you have any tip for me on an approach I've over looked?

Thanks!

A: 

Well, your question is a little bit vague still. In most cases, a "crawler" is going to just find data on the web in a text-format, and process it for storage, parsing, etc. The "Facebook Screenshot" thing is a different beast entirely.

If you're just looking for a web based crawler, there are several libraries that can be used to traverse the DOM of a web page very easily, and can grab content that you're looking for.

If you're using Python, try Beautiful Soup If you're using Ruby, try hpricot

If you want the entire contents of a webpage for processing at a later date, simply get and store everything underneat the "html" tag.

Here's a BeautifulSoup example to get all the links off a page:

require 'hpricot'
require 'open-uri'
doc = Hpricot(open("http://www.stackoverflow.com"))
(doc/"a").each do |link|
  puts link.attributes['href']
end

Edit: If you're going to primarily be grabbing content from the same sites (e.g. the comments section of Reddit, questions from StackOverflow, Digg links, etc) you can hardcode the format of them so your crawler can say, "Ok, I'm on Reddit, get everything with the class of 'thing'. You can also give it a list of default things to look for, such as divs with class/id of "main", "content", "center", etc.

Mike Trpcic
+1  A: 

There really is no standard way for web pages to mark "this is the meat". Most pages don't even want this because it makes stealing their core business easier. So you really have to write a framework which can use per-page rules to locate the content you want.

Aaron Digulla