views:

532

answers:

2

I havent used lucene. Last time i ask (many months ago, maybe a year) people suggested lucene. If i shouldnt use lucene what should i use? As am example say there are items tagged like this

  1. apples carrots
  2. apples
  3. carrots
  4. apple banana

if a user search apples i dont care if there is any preference from 1,2 and 4. However i seen many forums do this which i HATED is when a user search apple carrots 2 and 3 have high results while 1 is hard to find even though it matches my search more closely.

Also i would like the ability to do search carrots -apples which will only get me 3. I am not sure what should happen if i search carrots banana but anyways as long as more items tagged with 2 and 3 results are lower ranking then 1 when i search apples carrots i'll be happy.

Can lucene do this? and where do i start? I tried looking it up and when i do i see a lot of classes and i'll see tutorials talking about documents, webpages but none were clear about what to do when i like to tag something. If not lucene what should i use for tagging?

+4  A: 

Edit: You can use Lucene. Here's an explanation how to do this in Lucene.net. Some Lucene basics are:

  • Document - is the storage unit in Lucene. It is somewhat analogous to a database record.
  • Field - the search unit in Lucene. Analogous to a database column. Lucene searches for text by taking a query and matching it against fields. A field should be indexed in order to enable search.
  • Token - the search atom in Lucene. Usually a word, sometimes a phrase, letter or digit.
  • Analyzer - the part of Lucene that transforms a field into tokens.

Please read this blog post about creating and using a Lucene.net index.

I assume you are tagging blog posts. If I am totally wrong, please say so. In order to search for tags, you need to represent them as Lucene entities, namely as tokens inside a "tags" field.

One way of doing so, is assigning a Lucene document per blog post. The document will have at least the following fields:

  • id: unique id of the blog post.
  • content: the text of the blog post.
  • tags: list of tags.

Indexing: Whenever you add a tag to a post, remove a tag or edit it, you will need to index the post. The Analyzer will transform the fields into their token representation.

Document doc = new Document();
doc.Add(new Field("id", i.ToString(), Field.Store.YES, Field.Index.NO));
doc.Add(new Field("content", text, Field.Store.YES, Field.Index.TOKENIZED));
doc.Add(new Field("tags", tags, Field.Store.YES, Field.Index.TOKENIZED));
writer.AddDocument(doc);

The remaining part is retrieval. For this, you need to create a QueryParser and pass it a query string, like this:

QueryParser qp = new QueryParser();
Query q = qp.Parse(s);
Hits = Searcher.Search(q);

The syntax you need for s will be:

tags: apples tags: carrots

To search for apples or carrots

tags: carrots NOT tags: apples

See the Lucene Query Parser Syntax for details on constructing s.

Yuval F
Great answer. Too bad i overslept and didn't go to SO until the bounty was over. Adding search doesnt seem as bad as i originally thought.
acidzombie24
nice answer - the right way to implement 'tags' is an important question - because there are so many (wrong? painfully slow?) ways one could do it, and the idea of tags/folksonomies is here to stay (in favor of hierarchical taxonomies, that is)
Bobby
+2  A: 

Lucene for .net seems to be mature. No need to use Java or SOLR

The Standard query language for Lucene allows equally ranked search terms and negation

So if your Lucene index had a field "tag" your query would be

tag:apple* OR tag: carrot*

Which would give equal ranking to each word, and more rank weighting to document with both tags

To negate a tag use this

tag:carrot* NOT tag:apple*

Simple example to show indexing and querying with Lucene here

TFD
Thanks :). I hope more ppl keep this coming (i really need help!)
acidzombie24
this tutorial looks good and the query link looks useful. I suspect i'll be messing with this before the end of the day.
acidzombie24