views:

446

answers:

3

I need to develop a vertical search engine as part of website. The data for the search engine comes from websites of specific category. I guess for this I need to have a crawler that crawls several (a few hundred) sites (in a specific business category) and extract content and urls of products and services. Other types of pages may be irrelevant. Most of the sites are tiny or small (a few hundred pages at the most). The products have 10 to 30 attributes.

Any ideas on how to write such a crawler and extractor. I have written a few crawlers and content extractors using usual ruby libraries, but not a full fledged search engine. I guess, crawler, from time to time, wakes up and downloads the pages from websites. Usual polite behavior like checking robots exclusion rules will be followed, of course. While the content extractor can update the database after it reads the pages. How do I synchronize crawler and extractor? How tightly should they be integrated?

+1  A: 

In the enterprise-search context that I am used to working in,

  • crawlers,

  • content extractors,

  • search engine indexes (and the loading of your content into these indexes),

  • being able to query that data effciently and with a wide range of search operators,

  • programmatic interfaces to all of these layers,

  • optionally, user-facing GUIs

are all seperate topics.

(For example, while extracting useful information from an HTML page VS PDF VS MS Word files are conceptually similar, the actual programming for these tasks are still very much works-in-progress for any general solution.)

You might want to look at the Lucene suite of open-source tools, understand how those fit together, and possibly decide that it would be beter to learn how to use those tools (or others, similar), than to reinvent the very big, complicate wheel.

I believe in books, so thanks to your query, I have discovered this book and have just ordered it. It looks like good take on one possible solution to the search-tool conumdrum.

http://www.amazon.com/Building-Search-Applications-Lucene-LingPipe/product-reviews/0615204252/ref=cm_cr_pr_hist_5?ie=UTF8&showViewpoints=0&filterBy=addFiveStar

Good luck and let us know what you find out and the approach you decide to take.

A: 

Nutch builds on Lucene and already implements a crawler and several document parsers. You can also hook it to Hadoop for scalability.

Mauricio Scheffer
A: 

Hi Ven,

Even I'm looking for almost same answer..Did you Decide to implement any ...Please let me know what is your analysis from the option available for your to develop vertical search engine for a specific niche business ..I would appreciate, if you could share your expereince upon this implementation...

Thanks Kumar