views:

626

answers:

7

The story so far:

Decided to go with Xapian as search backend because it has all search-engine features I was looking for, knows about Unicode, stemming, has few dependencies and requires no bloated app-server installation on top of it.

Tried Django and Haystack (plus xapian-haystack, the backend glue code to tie Haystack to Xapian) because it was advertised on quite some blogs as "working". Did not work. Neither django-haystack nor the xapian-haystack project provide a version combination that actually works together. MASTER from both projects yields an error from Xapian, so it's not stable at all. Haystack 1.0.1 and xapian-haystack 1.0.x/1.1.0 are not API-compatible. Plus, in a minimally working installation of Haystack 1.0.1 and xapian-haystack MASTER, any complex query yields zero results due to errors in either django-haystack or xapian-haystack (I double-verified this), maybe because the unit-tests actually test very simple cases, and no edge-cases at all.

Tried Djapian. The source-code is riddled with spelling errors (mind you, in variable names, not comments), documentation is also riddled with ambiguities and outdated information that will never lead to a working installation. Not surprisingly, users rarely ask for features but how to get it working in the first place.

Next on the plate: exploring Solr (installing a Java environment plus Tomcat gives me headaches, the machine is RAM- and CPU-constrained), or Lucene (slightly less headaches, but still).

Before I proceed spending more time with a solution that might or might not work as advertised, I'd like to know: Did anyone ever get an actual, real-world search solution working in Django? I'm serious. I find it really frustrating reading about "large problems mostly solved", and then realizing that you will never get a working installation from the source-code because, actually, all bloggers dealing with those "mostly solved problems" never went past basic installation and copy-pasting the official tutorials.

So here are the requirements:

  • must be able to search for 10-100 terms in one query
  • must handle + (term must be present) and - (term must not be present), AND/OR
  • must handle arbitrary grouping (i.e. parentheses around AND/OR)
  • must allow for Django-ORM filtering before or after fulltext-search (i.e. pre-/post-processing of results with the full set of filters that Django knows about)
  • alternatively, there must be a facility to bulk-fetch the result set and transform it into a QuerySet
  • should be light on the machine, so preferably no humongous JVM and Java-based app-server installation

Is there anything out there that does this? I'm not interested in anecdotal evidence, or references to some blog posts that claim it should be working. I'd like to hear from someone who actually has a fully-functional setup working in the real world, under real conditions, with real queries.

EDIT:

Let me repeat again that I'm not so much interested in anecdotal evidence that someone, somewhere has a somewhat running installation working with unspecified properties. I already went there, I read all the blog posts, mailing lists, I contacted the authors, but when it came to actual implementation of real-world scenarios, nothing ever worked as advertised.

Also, and a user below brought that point up as well, considering the TCO of any project, I'm definitely not interested in hearing that someone, somewhere was able to pull it off once a vendor parachuted in an unknown number of specialists to monkey-patch the whole installation with specific domain-knowledge that's documented nowhere.

So, please, if you claim you have a working installation that actually satisfies minimum requirements for a full-fledged search (see requirements above), please provide the following so that we can all benefit from a search solution for Django that actually solves the problem:

  • exact Linux distribution, release version,
  • exact release version of Haystack (or equivalent) and release version of search backend,
  • exact release version of the search engine
  • publicly (!) available documentation how to set up all components exactly in the way that your installation was set up such that the minimal requirements above are met.

Thank you.

+5  A: 

Short answer: No.

We bailed and went with a Google Custom Search. Although the site has over 10,000 possible page views, we keep the sitemap feed down to the main 4,000 pages or so and it costs $250/year, which is about 2 hours of my time. The customer is happy and he feels comfortable with the results.

I'd love to see someone come up with a good FOSS solution, but in a commercial situation the TCO has got to make economic sense.

Peter Rowell
+2  A: 

I (and my colleagues) have successfully used Haystack to achieve a fairly good search functionality.

It is easy to start with haystack and whoosh backend; and change to the Apache-Solr backend when performance of whoosh is not acceptable.

We really got to get around to write a detailed post about it with links to the projects where it works.

For now I can suggest you to have a look at this search: http://www.webdevjobshq.com/search/?q=rails implemented using Haystack with Apache-Solr backend. Or this: http://www.govbuddy.com/search/?q=Roy

Lakshman Prasad
I'd very much be interested to know how you got it working, and which exact versions you are using, including the OS versions (e.g. Solr on Ubuntu 9.10 is a pain because there is no package for Tomcat 5.5 anymore, hence the solr-tomcat5.5 package can't be installed, which means pulling and compiling a lot of dependencies).
prometheus
Also, how many search terms does Whoosh handle? And is Whoosh in actual development? Judging from the project's Trac site, it's stalled and fatal errors still exist.
prometheus
The Trac site is out of date, Whoosh has moved to bitbucket : http://bitbucket.org/mchaput/whoosh/changesets/, and is still under active development.
Chris Lawlor
A: 

I use Djapian. It was quite simple to install and works great. There is an actual tutorial that covers basic use-cases and shows entire integration process.

Yes, it has some ambiguities but issue tracker is open and authors rapidly fixes bugs and add features.

Alex Koshelev
I have followed the tutorial, it doesn't work. Please tell me the exact versions that you used that led to a fully-working installation, and that actually work for real-world scenarios.
prometheus
+1  A: 

Have you considered Sphinx? What are you using as you data store? It has a MySQL engine that works terrific. I think it meet most of your requirements except I'm not exactly certain how nicely it can be tied into Django-ORM.

I'm heavily considering using Sphinx in one of my own Django Apps to improve performance on an auto-suggest field that does a prefix and infix search on a corpus of 3.5 million records. But I haven't got around to implementing it yet, so I can't speak to Django+Sphinx integration. My only Sphinx experience is with the MySQL Engine and directly querying MySQL.

nategood
And that's precisely the problem. I have stumbled upon Sphinx in the past, but I have never met anyone or read about how to actually integrate Sphinx into Django in a real-world scenario, including specific version numbers that are actually compatible. Keywords from the author of django-sphinx in his own words: "After installation you need to edit a few settings in settings.py, which, again, being that I suck at documentation, isn’t posted on the website." I'm not going to touch this one, sorry.
prometheus
I've got django-sphinx working fine. I'm using the latest versions of both, and infix and prefix searching work well. They create massive indexes, but otherwise, they work. Yes, the documentation isn't great from django-sphinx, but it's enough. The beauty of it is that it's actually a rather small connector, and you can figure out what's going on if it doesn't work like you expect. Sphinx is quite powerful and fast, and the support on the forum is good. And the settings that he says aren't posted...I believe they are posted on the project site.
mlissner
+4  A: 

I have developed some Django applications with xapian support too. The biggest of them has a xapian database with an index of 8G storing 2.4M documents (including forum posts, wiki entries, planet entries and blog entries) - still growing.

Overall I am quite happy with xapian. It performs extremely well and is easy to use. The only thing I don't like is that xapian won't work with mod_wsgi (except of the global mode) because of a deadlock. So you are forced to use fastcgi (or connect to xapian-tcpsrv or write your own service).

I recommend you, to use the xapian-bindings directly. Xapian nowadays offers quite a lot of useful helpers (TermGenerator, QueryParser etc), which makes both the indexing and the querying simple. In fact, there is nothing I can imaging which would justify an additional library. In my opinion they are all more complicated and don't allow you to index efficiently.

The only thing you need, is some understanding of the way how xapian is working. (What are terms? What are values? What is stemming and where should I use it? and so on). You can find all those topics on the xapian website, and as soon as you understand those concepts, dealing with xapian will become easy.

Also, the xapian API is extremly stable. I've started using it a long time before the 1.0 release and never had any problems with API changes or version conflicts. The only thing which has changed is that all those helpers (query parser, tokenizer, etc.) I have once written for my Django project are now useless, because similar classes have made their way into the xapian core.

So, to summarize, just give the direct usage of xapian-bindings a try.

tux21b
+1  A: 

I can vouch for Django-Haystack with the Xapian backend (In the interest of full disclosure, I am the author of the xapian-haystack backend) in a real life, production environment. We currently use Haystack/Xapian on several sites, the largest of which has more than 20,000 registered users and a Xapian database with 20,000+ documents containing more than 143,000 unique terms for a total size of ~141mb.

As for not being able to get any combination of Haystack and the Xapian backend running, I'll admit that I was not as diligent as I should have been with my tagging and so there is some confusion with the versions. You should, however, be able to use the current master of both codebases without any issue. If this is not case, I'd be more than happy to assist with problems. You'll need to be a little bit more specific about the issue though. Simply saying "it did not work" is not enough information.

Daniel and I both do our best to respond to any issues opened on Github within a timely manner. Also, we're both usually available on the #haystack IRC channel during the day and the django-haystack Google Group.

Versions used:

  • Haystack 1.0BETA with Xapian-Haystack 1.1.0BETA
  • Haystack 1.0.1FINAL with Xapian-Haystack 1.1.3BETA

Most of the sites we've deployed with Haystack have been running Ubuntu 8.04 LTS with Xapian 1.0.5

notanumber
+1  A: 

The details you requested.

  • exact Linux distribution, release version - Ubuntu 9.04 & 9.10
  • exact release version of Haystack (or equivalent) - Haystack 1.0 as well as master
  • release version of search backend - The Solr & Whoosh backends included with Haystack
  • exact release version of the search engine - Solr 1.3, Solr 1.4 & Whoosh 0.3.15
  • publicly (!) available documentation how to set up all components exactly in the way that your installation was set up such that the minimal requirements above are met.

Beyond this, it's the standard configuration bits from the tutorial, plus any additional overrides from (which I can't link to, thanks Stack Overflow) as needed.

As the maintainer of Haystack, I'm actively running all of the above previous setups. The smallest Haystack installation (Haystack 1.0 + Whoosh) is ~600 documents. A slightly larger one (Haystack master + Solr 1.4) is ~4000 documents. The largest deployment I'm aware of (Haystack master + Solr 1.4) is ~3 million documents.

I generally try to avoid Stack Overflow, so don't be surprised if you see nothing further from me. The mailing list is the best place for support, but given your responses thus far, I'm sure you'd rather just trash me here.

toastdriven
If you'd label "asking precise questions to verify claims made by developers" as "trashing", well, fine. The point I'm trying to make is that if you as the developer of Haystack only have an installation of 600 to 4000 indexed documents, you shouldn't make any claims that your software is fit for the real-world - because you haven't verified it yet. Also, I already tried the above versions - Solr on Ubuntu 9.x is a no-go because there is no Tomcat 5 package, master of Haystack/xapian-haystack are not stable, etc. So no-go for me.
prometheus