Let's say I want to aggregate information related to a specific niche from many sources (could be travel, technology, or whatever). How would I do that?
Have a spider/crawler who will crawl the web for finding the information I need (how would I tell the crawler what to crawl because I don't want to get the whole web?)? Then have an indexing system to index and organize the information I crawled and also be a search engine?
Are systems like Nutch lucene.apache.org/nutch OK to be used for what I want? Do you recommend something else?
Or can you recommend another approach?
For example, how Techmeme.com is built? (it's an aggregator of technology news and it's completely automated - only recently they added some human intervention). What would it take to build such a service?
Or how do Kayak.com aggregate their data? (It's a travel aggregator service.)