views:

478

answers:

4

My indexed documents have a field containing a pipe-delimited set of ids:

a845497737704e8ab439dd410e7f1328|
0a2d7192f75148cca89b6df58fcf2e54|
204fce58c936434598f7bd7eccf11771

(ignore line breaks)

This field represents a list of tags. The list may contain 0 to n tag Ids.

When users of my site view a particular document, I want to display a list of related documents. This list of related document must be determined by tags:

  • Only documents with at least one matching tag should appear in the "related documents" list.
  • Document with the most matching tags should appear at the top of the "related documents" list.


I was thinking of using a WildcardQuery for this but queries starting with '*' are not allowed.


Any suggestions?

+2  A: 

Your pipe-delimited set of ids should really have been separated into individual fields when the documents were indexed. This way, you could simply do a query for the desired tag, sorting by relevance descending.

Adam Paynter
That would not be very practical as the list may contain 0 to n Ids
Arnold Zokas
@ArnieZ: Please bare with me. What would not be practical?
Adam Paynter
@Adam Paynter: What if there are 20 tags? Would I then have to search against 20 fields?
Arnold Zokas
@ArnieZ: I suspect you should be able to construct a BooleanQuery with 20 TermQuery clauses. It would still be one query.
Adam Paynter
@Adam Paynter: The issue then is that some documents may contain 1 tagm some may contain 20 tags. I would have a search that tries to match against 19 tags that dont exist.
Arnold Zokas
@ArnieZ: When constructing a BooleanQuery, you can specify if the clause (the TermQuery in this case) MUST, MUST NOT or SHOULD exist. If you use SHOULD when building your BooleanQuery, Lucene will still find the document, even if only one of its 20 tags match. It will just score lowered than a document that matches 2 of its 20 tags, etc...
Adam Paynter
@Adam Paynter: I will investigate this. Thanks.
Arnold Zokas
@ArnieZ: I noticed a comment on Mike's answer. I just wanted to clarify: I meant that the pipe-delimited ids would be separated into multiple fields, all having the same field name.
Adam Paynter
@Adam Paynter: Ahh, I see. I will try this now.
Arnold Zokas
@ArnieZ: :) The BooleanQuery is constructed using one BooleanClause per tag IN THE DOCUMENT THAT THE USER IS VIEWING.
Adam Paynter
@Adam Paynter: Query parser came up with the following query: "Tags:a845497737704e8ab439dd410e7f1328 Tags:0a2d7192f75148cca89b6df58fcf2e54 Tags:204fce58c936434598f7bd7eccf11771"
Arnold Zokas
@Adam Paynter: The only result i get is the current document itself. Odd...
Arnold Zokas
@ArnieZ: Try building a query such as "Tags:a845497737704e8ab439dd410e7f1328 OR Tags:0a2d7192f75148cca89b6df58fcf2e54 OR Tags:204fce58c936434598f7bd7eccf11771"
Adam Paynter
@Adam Paynter: It looks like, in Lucene.NET my query syntax is equivalent to your query syntax. I will experiment with fewer tags...
Arnold Zokas
@Adam Paynter: Sadly, this approach does not work for me. I am going to try the Link Database approach. Thanks for the help.
Arnold Zokas
@ArnieZ: Did you separate the ids into multiple fields? Where the fields specified as being indexed?
Adam Paynter
@Adam Paynter: Yes and yes.
Arnold Zokas
+2  A: 

You can have the same field multiple times in a document. In this case, you would add multiple "tag" fields at index time by splitting on |. Then, when you search, you just have to search on the "tag" field.

Mike
I didn't know this was possible. Do you know of any SKD page that describes this?
Arnold Zokas
I usually use the Java versions documentation "Several fields may be added with the same name": http://lucene.apache.org/java/2_4_1/api/org/apache/lucene/document/Document.html#add(org.apache.lucene.document.Fieldable))
Mike
@Mike: Thanks, I'll look into this.
Arnold Zokas
+1  A: 

Setting aside for a minute the possible uses of Lucene for this task (which I am not overly familiar with) - consider checking out the LinkDatabase.

Sitecore will, behind the scenes, track all your references to and from items. And since your multiple tags are indeed (I assume) selected from a meta hierarchy of tags represented as Sitecore Items somewhere - the LinkDatabase would be able to tell you all items referencing it.

In some sort of pseudo code mockup, this would then become

for each ID in tags
  get all documents referencing this tag
  for each document found
    if master-list contains document; increase usage-count
    else; add document to master list
sort master-list by usage-count descending

Forgive me that I am not more precise, but am unavailable to mock up a fully working example right at this stage.

You can find an article about the LinkDatabase here http://larsnielsen.blogspirit.com/tag/XSLT. Be aware that if you're tagging documents using a TreeListEx field, there is a known flaw in earlier versions of Sitecore. Documented here: http://www.cassidy.dk/blog/sitecore/2008/12/treelistex-not-registering-links-in.html

Mark Cassidy
@Mark Cassidy: I have not used the Link Database before, but I am going to try your approach.
Arnold Zokas
I wrote up a full article with an implementation of this pseudo-code - if for no other reason than just to assert to myself it could be done the way I envisioned ;-) You can find it here: http://www.cassidy.dk/blog/sitecore/2009/05/listing-related-articles-with-sitecore.html
Mark Cassidy
+1  A: 

Try this query on the tag field.

+(tag1 OR tag2 OR ... tagN)

where tag1, .. tagN are the tags of a document.

This query will return documents with at least one tag match. The scoring automatically will take care to bring up the documents with highest number of matches as the final score is sum of individual scores.

Also, you need to realizes that if you want to find documents similar to tags of Doc1, you will find Doc1 coming at the top of the search results. So, handle this case accordingly.

Shashikant Kore