views:

2812

answers:

5

I'm currently looking at other search methods rather than having a huge SQL query. I saw elasticsearch recently and played with woosh (a python implementation of a search engine).

Can you argument your choices on why you chose or will choose any of those project ?

+2  A: 

I don't have a vast amount of experience in this field but I have worked with both Lucene and Sphinx. Lucene was fine up to a few thousand items being indexed but beyond that it was unusable, searches were unbearably slow and re-building the index would take hours.

Sphinx performs very well; I have it running on a site at present with 70,000+ articles, searches complete very quickly and it can rebuild its entire index in ~11 seconds. I chose Sphinx based on recommendations from other developers and the knowledge that a few big sites rely on it for their search engines (Neowin being one of them).

Steve
How did you manage the build of your index ? is there a way to update the index when a new object is created or deleted ?
dzen
It's possible to do incremental updates with Lucence : http://wiki.apache.org/lucene-java/LuceneFAQ#What_is_index_optimization_and_when_should_I_use_it.3F
dzen
I don't know exactly how it's configured in our environment as I wasn't the only one involved in setting it up; but as far I know you can write new data to a delta which is used and then have that merged into the index. But I believe it's still wise to regularly rebuild the index to ensure efficiency.I think in our setup we don't use a delta and just rebuild the index fairly regularly which does the job pretty well.
Steve
+6  A: 

We use Lucene regularly to index and search tens of millions of documents. Searches are quick enough, and we use incremental updates that do not take a long time. It did take us some time to get here. The strong points of Lucene are its scalability, a large range of features and an active community of developers. Using bare Lucene requires programming in Java.

If you are starting afresh, the tool for you in the Lucene family is Solr, which is much easier to set up than bare Lucene, and has almost all of Lucene's power. It can import database documents easily. Solr are written in Java, so any modification of Solr requires Java knowledge, but you can do a lot just by tweaking configuration files.

I have also heard good things about Sphinx, especially in conjunction with a MySQL database. Have not used it, though.

IMO, you should choose according to:

  • The required functionality - e.g. do you need a French stemmer? Lucene and Solr have one, I do not know about the others.
  • Proficiency in the implementation language - Do not touch Java Lucene if you do not know Java. You may need C++ to do stuff with Sphinx. Lucene has also been ported into other languages. This is mostly important if you want to extend the search engine.
  • Ease of experimentation - I believe Solr is best in this aspect.
  • Interfacing with other software - Sphinx has a good interface with MySQL. Solr supports ruby, XML and JSON interfaces as a RESTful server. Lucene only gives you programmatic access through Java. Compass and Hibernate Search are wrappers of Lucene that integrate it into larger frameworks.
Yuval F
you raised an important notion that a search-engine must be adaptable.
dzen
+2  A: 

I have used both Sphinx and Lucene/Solr.

If you want to just have a simple full text search setup, sphinx is a better choice.

If you want to customize your search at all, lucene/solr is the better choice. It's very extensible: you can write your own plugins to adjust result scoring.

Solr is built on top of Lucene. It adds many common functionality: web server api, faceting, caching, etc.

Some example usages: Sphinx: craigslist.org; Lucene/Solr: Linkedin, Cnet, Netflix, digg.com.

tommy chheng
+2  A: 

Lucene is nice and all, but their stop word set is awful. I had to manually add a ton of stop words to StopAnalyzer.ENGLISH_STOP_WORDS_SET just to get it anywhere near usable.

I haven't used Sphinx but I know people swear by its speed and near-magical "ease of setup to awesomeness" ratio.

larley
+12  A: 

As the creator of ElasticSearch, maybe I can give you some reasoning on why I went ahead and created it in the first place :).

Using pure Lucene is challenging. There are many things that you need to take care for if you want it to really perform well, and also, its a library, so no distributed support, its just an embedded Java library that you need to maintain.

In terms of Lucene usability, way back when (almost 6 years now), I created Compass. Its aim was to simplify using Lucene and make everyday Lucene simpler. What I came across time and time again is the requirement to be able to have Compass distributed. I started to work on it from within Compass, by integrating with data grid solutions like GigaSpaces, Coherence and Terracotta, but its not enough.

At its core, a distributed Lucene solution needs to be sharded. Also, with the advancement of HTTP and JSON as ubiquitous APIs, it menas that a solution that many different systems with different languages can easily be used.

This is why I went ahead and created ElasticSearch. It has a very advance distributed model, speaks natively JSON, and exposes many advance search features, all seamlessly expressed through JSON dsl.

Solr is also a solution for exposing an indexing/search server over HTTP, but I would argue that ElasticSearch provides a much superior distributed model and ease of use (though currently lacking on some of the search features, but not for long, and in any case, the plan is to get all Compass features into ElasticSearch). Of course, I am biased, since I created ElasticSearch, so you might need to check for yourself.

As for Sphinx, I have not used it, so I can't comment. What I can refer you is to this thread at Sphinx forum which I think proves the superior distributed model of ElasticSearch.

Of course, ElasticSearch has many more features then just being distributed. It is actually built with cloud in mind. You can check the feature list on the site.

kimchy