views:

378

answers:

5

Let's say I want to aggregate information related to a specific niche from many sources (could be travel, technology, or whatever). How would I do that?

Have a spider/crawler who will crawl the web for finding the information I need (how would I tell the crawler what to crawl because I don't want to get the whole web?)? Then have an indexing system to index and organize the information I crawled and also be a search engine?

Are systems like Nutch lucene.apache.org/nutch OK to be used for what I want? Do you recommend something else?

Or can you recommend another approach?

For example, how Techmeme.com is built? (it's an aggregator of technology news and it's completely automated - only recently they added some human intervention). What would it take to build such a service?

Or how do Kayak.com aggregate their data? (It's a travel aggregator service.)

+1  A: 

For a basic look - check out this: http://en.wikipedia.org/wiki/Aggregator

It will give you an overview of aggregators in general.

In terms of how to build your own aggregator if you're looking for something out of the box that can get you content that YOU want - I'd suggest this: http://dailyme.com/

If you're looking for a codebase / architecture to BUILD your own aggregator-service - I'd suggest looking at something straight forward - like: Open Reddit from http://www.reddit.com/

Gabriel
Yes, I would want my own aggregator. Reddit is like a Digg site and that means users will submit links and vote on them (Pligg or SocialWebCMS are also software which allows you to built something like Digg).What I want is more like Techmeme (where the news are gathered automatically and the editors can rank or show them on the site, if necessary).
+1  A: 

Thats a long list of questions. :-) Having made one such search engine (for FAQs), I would say that there is no one answer. The design will be very specific to the problem statement.

Chirayu Patel
Ok, then this is one question :): how Techmeme is built? (of course, nobody knows details except the those developers but at least some ideas...)
+1  A: 

This all depends on the aggregator you are looking for.

Types:

  • Losely defined - Generially this requires for you datasource to be very flexible about determining the type of information gathers (answers the question of is this site/information Travel Related? Humour? Business related? )
  • Specific - This relaxes a requirement in the data storage that all of the data is specificially travel related requires for flights, hotel prices, etc.

Typcially an aggregator is a system of sub programs:

  1. Grabber, this searches and grabs all of the content that is needed to be summarized
  2. Summerization- this is typically done through queries to the db and can be adjusted based on user preferences [through programming logic]
  3. View - this formats the information for what the user would like to see and can respond to feedback on the user's likes or dislikes of the item suggested.
monksy
A: 

http://www.thefind.com/. This is another aggregator. I think another element of this question is scraping or aggregating published data from non-member sites vs including multiple databases from partner sites.

adam
A: 

You need to define what your application is going to do. Building your own web crawler is a huge task as you tend to keep adding new features as you find you need them... only to complicate your design, etc...

Building an aggregator is much different. Whereas a crawler simply retrieves data to be processed later, an aggregator takes already defined sets of data and puts them together. If you use an aggregator, you will probably want to look for already defined travel feeds, financial feeds, travel data, etc... An aggregator is easier to build IMO, but it's more constrained.

If you, instead, want to build a crawler you'll need to define starting pages, define ending conditions (crawl depth, time, etc...) and so on and then still process the data afterwards (that is aggregate, summarize and so on).

Chad