views:

273

answers:

1

I'm developing a sort of personalized search engine in Ruby on Rails, and I'm currently trying to find best way of sorting results depending on user's record, in real time.

Example: items that are searched for can have tags (separate entities with ids), for example item has tags=[1, 5, 10, 23, 45].

User, on the other hand, may have flagged some tags as of special interest, so say user has tags=[5, 23].

Score used for sorting the results should take into account number of item's tags which are "stared" by the user. For example, item's score would be 50% based on item's attributes and 50% on rank depending on user's (number of tags stared).

One idea was to inject this into sorting function in information retrieval system. But in Sphinx, which I will probably use, it would be very awkward to implement (when user's vector is large). I don't know about Lucene/solr, but they don't seem to have advanced non-text search capabilities that I need anyway (distance, date, time, etc.)

Other option is to retrieve intermediary set from IR system and then process it on application level. However, I'm quite sure that processing 100-1000 records sequentially and then sorting them in Rails would be very slow.

On the other hand, it seems like task that can be easily processed in parallel - divide 1000 records into sets that processed by separate threads and then sorted.

I read about several map reduce implementations, both universal like hadoop and rails-specific like skynet etc., but they to be best suited for large batch jobs, not real time processing (unless I'm mistaken?).

Is there any in-memory, light MR implementation that I could use for this? Or maybe you have some other ideas how to handle it?

(sidenote: I believe this setup is similar to how google news works, from what I understand from "Google News personalization: scalable online collaborative filtering" paper. They match in real time a set of candidate stories with set of clusters to which user belongs to (pre-calculated earlier) to sort stories in personalized way)

+1  A: 

Map/Reduce is great for this kind of thing but you can deal with it in SQL using an intermediate table.

Presumably, you already have tables like this:

users (id, ...)
items (id, ...)
tags (id, ...)
users_tags (user_id, tag_id)
items_tags (item_id, tag_id)

So, why don't you also maintain a table like this:

users_items_tags (user_id, item_id, tag_id)

where each row means "this user and this item share this tag".

Then, your search query is something like this:

  select item_id, count(tag_id) as score
    from users_items_tags
   where user_id = <USER_ID>
group by item_id
order by score desc

When a user adds a tag, users_items_tags is updated like so:

insert into users_items_tags (user_id, item_id, tag_id)
     select <USER_ID>, item_id, <TAG_ID>
       from items_tags
      where tag_id = <TAG_ID>

and likewise when adding a tag to an item. When a tag is removed, just delete on the tag and user/item.

This solution has a few problem cases. If a particular tag is common amongst items then a lot of writes will be performed when a user adds that tag, and vice versa. If a tag is common amongst both items and users then the table will grow very large. You'll have to consider these cases for your particular dataset.

jedediah
Thx, that's one possibility, however since user may well target to have 20-50% of items recommended to him and number of tags is intentionally limited, this would result in quite massive amount of data.
Otigo