views:

139

answers:

4

Hi everyone. This is more of a theory question rather than practice. I'm working on a project which is quite a simple catalog of links. The whole model is similar to the Dmoz or Yahoo catalog, except that each entry has certain additional attributes.

I have hierarchical taxonomy working on all entries with many-to-many relationship, all entries are now sorted into these categories, and everything seems to work fine. Now, what use is a catalog if there's no search option?

Here's a little bit more detail about my models: Each entry has a title, description, URL and several social profiles: YouTube, Twitter, Flickr and a couple of others. Each entry could have a logo attached to it, and a hidden field for tags. Also, the title and description are stored in three different languages. So basically I'd like the search results to be:

  1. Relevant (including taxonomy)
  2. Possibly ones with logos
  3. Possibly ones with 100% filled out profiles

I've tried Sphinx and currently working with Lucene, but it seems that I'm not getting the search right in theory. I hope it does make sense that filled entries should appear higher than the others, but I can't really figure out the scores. I wouldn't like irrelevant entries appear on top if there's simply one word match in the entire description, since titles are more relevant.

So my question is - are there any books, techniques or even other search engines (if Sphinx and Lucene are not good enough) that you would recommend for this matter? Not only I would like to get full control over search results and their ranking, but also give my visitors correct and relevant information.

Links on cool articles are appreciated too!

And No, I'm not trying to rebuild Google :)

Thanks :)

+3  A: 

I'm pretty sure that Lucene is enough. We have solved similar task and did it well. Here are some hints that I can propose you looking back at my project at Lucene.Net .

Taxonomy:

  • Category has represented as integer key in db, so each document has multiple instances of field 'CATEGORY' of type Number. For example document:[1,2,5,10, 'Wheel'] - means that wheel belongs to each of category.

Non-searchable fields (logos, social profile):

  • Of course you can store non-searchable values in lucene's non-indexed fields. But we have stored all product related information in DB to avoid rebuilding Lucene's index. So Lucene owns only by ID of product and indexed but stored values for key fields.

Three languages and multiple fields:

  • We have only 2 languages. So different titles of product can be stored in the same Lucene's document and relate to single ID of product (as I write before ID refers to DB). This allows you search product even if user request uses mix of languages.
  • Obviously title, tags and description have different weight for search result. Lucene handles it by assigning to field weight.
Dewfy
+3  A: 

Excellent book: Lucene in Action (2nd edition)

When we started with Lucene we had the first edition, it really takes you through everything you need step by step. Highly recommended. The 2nd edition is updated for the latest and greatest version (3.x.x).

The Tf-Idf algorithm works very well on (larger) texts, but if you have a record-like structure it may backfire: the documents with a few terms are considered more "relevant" than the ones with many terms. With Lucene, you will get it to work, but you'll have to get your hands dirty.

What you'll basically have to do is boost your title field, so it becomes more relevant. You may also change the scoring mechanism to assign higher scores for documents that have more information.

Have fun. If you can't figure it out, there is excellent support on the Lucene mailinglist.

Matthijs Bierman
+1  A: 

Lucene or Solr would do the job. Solr is built on top of lucene, see here for more info

I would go with solr. download + setting it up is easy and fast. Get started with the tutorial and my link collection. Relevancy should be fine with solr and is easy tunable.

Look into Dewfy and Matthijs Bierman answer for some good points.

Then choose the dismax query handler and you can prefer docs with certain properties.

E.g. for the percentage of a full profile you define a separate field 'profile_completness' then you can add profile_completeness to bf (boostfunction) of dismax handler: the more complete the profile is the more those docs will be boosted.

I mentioned before that you can easily tune the relevancy: e.g. you can set up bf to sth. like: bf=title^10 tags^5 profile_completeness^1

"Possibly ones with logos" can be solved via boost queries: bq=logo:[* TO *]^1. Where logo:[* TO *] means "only docs which contains the field logo"

To display a deeply nested category tree you will need to create that tree in memory and feed solr with a special import. We have a working app for that. You can use our approach

If you need further assistance don't hesitate to comment.

Karussell