views:

99

answers:

3

The title is a bit awkward but I couldn't found a better one. My problem is as follows:

I have several users stored as documents and I am storing several key-value-pairs or items (which have an id) for each document. Now, if I apply highlighting with hl.snippets=5 I can get the first 5 items. But every user could have several hundreds items, so

  • you will not get the most relevant 5 items. You will get the first 5 items ...

Another problem is that

  • the highlighted text won't contain the id and so retrieving additional information of the highlighted item text is ugly.

Example where items are emails:

user1 has item1 { text:"developers developers developers", id:1, title:"ms" }
          item2 { text:"c# development",                   id:2, title:"nice!" }
          ...
          item77 ...

user2 has item1 { text:"nice restaurant", id:3, title:"bla"}
          item2 { text:"best cafe",       id:4, title:"blup"}
          ...
          item223 ...

Now if I use highlighting for the text field and query against "restaurant" I get user2 and the text nice <b>restaurant</b>. But how can I determine the id of the highlighted text to display e.g. the title of this item? And what happens if more relevant items are listed at the end of the item-list? Highlighting won't display those ...

So how can I find the best items of a documents with multiple such items?

I added my two findings as answers, but as I will point out each of them has its own drawbacks.

Could anyone point me to a better solution?

A: 

You can use the collapse patch and store each item as separate document linking back to the user.

The problem of that approach is that you won't get the most relevant user. Ie. the most relevant item is not necessarily from the most relevant user (because he can have several slightly less relevant items)

See the "Assume the following example:" part in my second answer.

Karussell
A: 

You could use use two indices: users->items as described in the question and an index with 'pure items' referencing back to the user.

Then you will need 2 queries (thats the reason I called the question '2d Search in Solr'):

  1. query the user index => list of e.g. 10 users
  2. query the items index for each user of the 1. step => best items

Assume the following example:

userA emails are "restaurant X is bad but restaurant X is cheap", "different topic", "different topicB" and

userB emails are "restaurant X is not nice", "revisited restaurant X and it was ok now", "again in restaurant X and I think it is the best".

Now I query the user index for "restaurant X" and the first user will be userB, which is what I want. If I would query only the item-index I would get the item1 of less relevant userA.

Drawbacks:

  • bad performance, because you will need one query against the user index and e.g. 10 more to get the most relevant items for each user.
  • maintaining two indices.

Update to avoid many queries I will try the following: using the user index to get some highlighted snippets and then offering a 'get relevant items'-button for every user which then triggers a query against the item index.

Karussell
Why do you have to search for 'users' first? Also I don't understand why *11* queries.
Mauricio Scheffer
+1  A: 

One of my rules of thumb for designing Solr schemas is: the document is what you will search for.

If you want to search for 'items', then these 'items' are your documents. How you store other stuff, like 'users', is secondary. So 'users' could be in another index like you mentioned, they could be "denormalized" (e.g. their information duplicated in each document), in a relational database, etc. depending on RDBMS availability, how many 'users' there are, how many fields these 'users' have, etc.

EDIT: now you explain that the 'items' are emails, and a possible search is 'restaurant X' and you want to find the best 'items' (emails). Therefore, the document is the email. The schema could be as simple as this: (id, title, text, user).

You could enable highlighting to get snippets of the 'text' or 'title' fields matching the 'restaurant X' query.

If you want to give the end-user information about the users that wrote about 'restaurant X', you could facet the 'user' field. Then the end-user would see that John wrote 10 emails about 'restaurant X' and Robert wrote 6. The end-user thinks "This John dude must know a lot about this restaurant" so he drills down into a search by 'restaurant x' with a filter query user:John

Mauricio Scheffer
Thanks Mauricio, for the suggestion! The problem is that I want users as 'documents'. But I also need to show the best items of each user, because there could be a lot items ... so that my customers can easily decide when they search, if the suggested users match their query intention
Karussell
The problem then is if a query returns items only of a few or even only one user, so the collapse patch as described in my other answer could improve this situation so that I get maximum 1 or 2 items per user.
Karussell
@Karussell: yes I think I'm slowly getting to understand your problem now :-) . I guess you'll have to decide what's more important for the end-user: getting relevant 'items' (whatever the 'user' is) or getting not so relevant 'items' from a variety of 'users'.
Mauricio Scheffer
@Mauricio thank you for taking your time for my problem! and sorry, if I didn't precisly describe what I want :-/ At the current stage I will use highlighting to display two snippets for every user and then query the second index (via javascript) only if the customer needs more items of that user.
Karussell