ansaurus

Question

How to find related items by tags in Lucene.NET

Answer 1

+2 A:

Your pipe-delimited set of ids should really have been separated into individual fields when the documents were indexed. This way, you could simply do a query for the desired tag, sorting by relevance descending.

Adam Paynter 2009-05-11 14:01:51

That would not be very practical as the list may contain 0 to n Ids

Arnold Zokas 2009-05-11 14:02:45

@ArnieZ: Please bare with me. What would not be practical?

Adam Paynter 2009-05-11 14:03:59

@Adam Paynter: What if there are 20 tags? Would I then have to search against 20 fields?

Arnold Zokas 2009-05-11 14:05:13

@ArnieZ: I suspect you should be able to construct a BooleanQuery with 20 TermQuery clauses. It would still be one query.

Adam Paynter 2009-05-11 14:07:03

@Adam Paynter: The issue then is that some documents may contain 1 tagm some may contain 20 tags. I would have a search that tries to match against 19 tags that dont exist.

Arnold Zokas 2009-05-11 14:10:18

@ArnieZ: When constructing a BooleanQuery, you can specify if the clause (the TermQuery in this case) MUST, MUST NOT or SHOULD exist. If you use SHOULD when building your BooleanQuery, Lucene will still find the document, even if only one of its 20 tags match. It will just score lowered than a document that matches 2 of its 20 tags, etc...

Adam Paynter 2009-05-11 14:13:35

@Adam Paynter: I will investigate this. Thanks.

Arnold Zokas 2009-05-11 14:15:30

@ArnieZ: I noticed a comment on Mike's answer. I just wanted to clarify: I meant that the pipe-delimited ids would be separated into multiple fields, all having the same field name.

Adam Paynter 2009-05-11 14:21:02

@Adam Paynter: Ahh, I see. I will try this now.

Arnold Zokas 2009-05-11 14:24:49

@ArnieZ: :) The BooleanQuery is constructed using one BooleanClause per tag IN THE DOCUMENT THAT THE USER IS VIEWING.

Adam Paynter 2009-05-11 14:26:31

@Adam Paynter: Query parser came up with the following query: "Tags:a845497737704e8ab439dd410e7f1328 Tags:0a2d7192f75148cca89b6df58fcf2e54 Tags:204fce58c936434598f7bd7eccf11771"

Arnold Zokas 2009-05-11 14:42:52

@Adam Paynter: The only result i get is the current document itself. Odd...

Arnold Zokas 2009-05-11 14:43:30

@ArnieZ: Try building a query such as "Tags:a845497737704e8ab439dd410e7f1328 OR Tags:0a2d7192f75148cca89b6df58fcf2e54 OR Tags:204fce58c936434598f7bd7eccf11771"

Adam Paynter 2009-05-11 14:47:27

@Adam Paynter: It looks like, in Lucene.NET my query syntax is equivalent to your query syntax. I will experiment with fewer tags...

Arnold Zokas 2009-05-11 14:56:42

@Adam Paynter: Sadly, this approach does not work for me. I am going to try the Link Database approach. Thanks for the help.

Arnold Zokas 2009-05-11 15:08:17

@ArnieZ: Did you separate the ids into multiple fields? Where the fields specified as being indexed?

Adam Paynter 2009-05-11 15:12:33

@Adam Paynter: Yes and yes.

Arnold Zokas 2009-05-11 15:16:09

Answer 2

+2 A:

You can have the same field multiple times in a document. In this case, you would add multiple "tag" fields at index time by splitting on |. Then, when you search, you just have to search on the "tag" field.

Mike 2009-05-11 14:07:39

I didn't know this was possible. Do you know of any SKD page that describes this?

Arnold Zokas 2009-05-11 14:13:01

I usually use the Java versions documentation "Several fields may be added with the same name": http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/document/Document.html#add(org.apache.lucene.document.Fieldable))

Mike 2009-05-11 14:16:19

@Mike: Thanks, I'll look into this.

Arnold Zokas 2009-05-11 14:23:30

Answer 3

+1 A:

Setting aside for a minute the possible uses of Lucene for this task (which I am not overly familiar with) - consider checking out the LinkDatabase.

Sitecore will, behind the scenes, track all your references to and from items. And since your multiple tags are indeed (I assume) selected from a meta hierarchy of tags represented as Sitecore Items somewhere - the LinkDatabase would be able to tell you all items referencing it.

In some sort of pseudo code mockup, this would then become

for each ID in tags
  get all documents referencing this tag
  for each document found
    if master-list contains document; increase usage-count
    else; add document to master list
sort master-list by usage-count descending

Forgive me that I am not more precise, but am unavailable to mock up a fully working example right at this stage.

You can find an article about the LinkDatabase here http://larsnielsen.blogspirit.com/tag/XSLT. Be aware that if you're tagging documents using a TreeListEx field, there is a known flaw in earlier versions of Sitecore. Documented here: http://www.cassidy.dk/blog/sitecore/2008/12/treelistex-not-registering-links-in.html

Mark Cassidy 2009-05-11 14:57:43

@Mark Cassidy: I have not used the Link Database before, but I am going to try your approach.

Arnold Zokas 2009-05-11 15:08:55

I wrote up a full article with an implementation of this pseudo-code - if for no other reason than just to assert to myself it could be done the way I envisioned ;-) You can find it here: http://www.cassidy.dk/blog/sitecore/2009/05/listing-related-articles-with-sitecore.html

Mark Cassidy 2009-05-14 23:45:01

Answer 4

+1 A:

Try this query on the tag field.

+(tag1 OR tag2 OR ... tagN)

where tag1, .. tagN are the tags of a document.

This query will return documents with at least one tag match. The scoring automatically will take care to bring up the documents with highest number of matches as the final score is sum of individual scores.

Also, you need to realizes that if you want to find documents similar to tags of Doc1, you will find Doc1 coming at the top of the search results. So, handle this case accordingly.

Shashikant Kore 2009-05-11 17:08:55

ansaurus

tags:

views:

answers:

How to find related items by tags in Lucene.NET

related questions