views:

109

answers:

3

I am looking for a search engine that can regularly (daily-ish) scan about 100 pages for changes and index an associated site if changes since the last scan are found. It should be able to handle about 100 sites, each averaging 4000 pages of about 5k average size, each on a different server (but only the one centralized search engine). Each of these sites will have a search form that gets submitted to this search engine. The results that are returned must be specific to the site that submitted them. I create the templates for the external sites, so I can give the search form a hidden field that specifies which site the form is submitted from.

What would you recommend I look into?

I would love to use a Python-based system for this, if feasible.

I am currently using something called iSearch2. It doesn't seem very stable at this scale, the description of the product states it is not really intended to do multiple sites, is in PHP (which is less comfortable to me than Python), and has a few other shortcomings for my specific situation.

+1  A: 

I recommend PyLucene. PyLucene is a Python extension for accessing Java Lucene and works very well and fast.

aeby
+1  A: 

If you're looking for a pure python search engine you could look at whoosh. The problem with whoosh is that it's slow and not as full featured. It would be fine if your site doesn't get much traffic, but you might need something more robust for production.

With that being said, I like using Xapian with its python bindings. It's pretty fast and easy to set up.

You could also use solr which has a python api. Solr is written in Java, but don't let that fool you as it's the best performer out of this bunch. You'll just have to run a java server to get this working.

Since I use Django I can integrate haystack into my projects which makes it easy to switch search engines. I'll use Whoosh for development because it's easy and fast to set up (it can install in the virtualenv), but deploy with Xapian or Solr for production depending on my needs.

digitaldreamer
A: 

+1 for Lucene. If PyLucene seems complex, alternately you could look at Solr (which is a search server based on Lucene with an HTTP interface. Highly scalable, blazing fast and offers very great featureset such faceted browsing, caching etc. OOTB

Since Solr is HTTP based you could hook into any language (incl. Python) using its RESTful API.

Mikos