ansaurus

Question

Working with a large data object between ruby processes

Answer 1

A:

be careful with memcache, it has some object size limitations (2mb or so)

One thing to try is to use MongoDB as your storage. It is pretty fast and you can map pretty much any data structure into it.

Zepplock 2010-05-26 03:29:47

Yes. I did hit the memcached limit. It was 1MB. I was able to get around this by gzip compressing the object after using Marshal.dump on it and it fit under the limit, but hurt performance even further.

Gdeglin 2010-05-26 03:41:13

Answer 2

A:

If it's sensible to wrap your monster hash in a method call, you might simply present it using DRb - start a small daemon that starts a DRb server with the hash as the front object - other processes can make queries of it using what amounts to RPC.

More to the point, is there another approach to your problem? Without knowing what you're trying to do, it's hard to say for sure - but maybe a trie, or a Bloom filter would work? Or even a nicely interfaced bitfield would probably save you a fair amount of space.

Judson 2010-05-27 06:02:42

I have a related question that explains what kind of data is stored and why here: http://stackoverflow.com/questions/2878429/algorithm-for-finding-similar-users-through-a-join-tableDRb sounds interesting. I'm hesitant to jump into it since it seems to be really old (last update June 04) and a lot of google search results for it are bug discussions. One possibility might be a Sinatra app that exposes an API to act on the hash (at least that would likely be simpler and better understood).

Gdeglin 2010-06-01 23:10:06

The "last update" issue I think comes from the fact that it's part of the Ruby stdlib. The best thing to look at I think is just the rubydoc for it at ruby.org/stdlib.

Judson 2010-06-08 18:39:40

Answer 3

A:

Have you considered upping the memcache max object size?

Versions greater than 1.4.2

memcached -I 11m #giving yourself an extra MB in space

or on previous versions changing the value of POWER_BLOCK in the slabs.c and recompiling.

Mike Buckbee 2010-06-03 13:57:26

Yes. But that's not the problem. Simply putting the object into memcached and pulling it out of memcached takes too long since the object must go through a serialization and deserialization process.

Gdeglin 2010-06-03 21:46:32

Answer 4

+1 A:

A sinatra app will work, but the {un}serializing, and the HTML parsing could impact performance compared to a DRb service.

Here's an example, based on your example in the related question. I'm using a hash instead of an array so you can use user ids as indexes. This way there is no need to keep both a table on interests and a table of user ids on the server. Note that the interest table is "transposed" compared to your example, which is the way you want it anyways, so it can be updated in one call.

# server.rb
require 'drb'

class InterestServer < Hash
  include DRbUndumped # don't send the data over!

  def closest(cur_user_id)
    cur_interests = fetch(cur_user_id)
    selected_interests = cur_interests.each_index.select{|i| cur_interests[i]}

    scores = map do |user_id, interests|
      nb_match = selected_interests.count{|i| interests[i] }
      [nb_match, user_id]
    end
    scores.sort!
  end
end

DRb.start_service nil, InterestServer.new
puts DRb.uri

DRb.thread.join


# client.rb

uri = ARGV.shift
require 'drb'
DRb.start_service
interest_server = DRbObject.new nil, uri


USERS_COUNT = 10_000
INTERESTS_COUNT = 500

# Mock users
users = Array.new(USERS_COUNT) { {:id => rand(100000)+100000} }

# Initial send over user interests
users.each do |user|
  interest_server[user[:id]] = Array.new(INTERESTS_COUNT) { rand(10) == 0 }
end

# query at will
puts interest_server.closest(users.first[:id]).inspect

# update, say there's a new user:
new_user = {:id => 42}
users << new_user
# This guy is interested in everything!
interest_server[new_user[:id]] = Array.new(INTERESTS_COUNT) { true } 

puts interest_server.closest(users.first[:id])[-2,2].inspect
# Will output our first user and this new user which both match perfectly

To run in terminal, start the server and give the output as the argument to the client:

$ ruby server.rb
druby://mal.lan:51630

$ ruby client.rb druby://mal.lan:51630
[[0, 100035], ...]

[[45, 42], [45, 178902]]

Marc-André Lafortune 2010-06-04 22:02:11

Thanks. I'm actually pretty far along with a sinatra app that connects to the same database to initially load the data, then has a REST api to keep things up to date and to fetch common users when sent a user id. DRb seems pretty elegant from your example so I think I'll give that a shot too.

Gdeglin 2010-06-05 01:59:31

This worked great. Thanks again.

Gdeglin 2010-06-08 02:27:00

Answer 5

+2 A:

Maybe it's too obvious, but if you sacrifice a little access speed to the members of your hash, a traditional database will give you much more constant time access to values. You could start there and then add caching to see if you could get enough speed from it. This will be a little simpler than using Sinatra or some other tool.

ndp 2010-06-05 15:46:47

Thanks for the response. The data actually is stored in a traditional database. I'm copying it all into a hash because it's just too slow to run the queries I need with SQL. You can get a better idea of what I'm using this for by looking at this question: http://stackoverflow.com/questions/2878429/algorithm-for-finding-similar-users-through-a-join-table

Gdeglin 2010-06-06 18:20:02

Answer 6

A:

What about storing the data in Memcache instead of storing the Hash in Memcache? Using your code above:

@a = []
0.upto(500) do |r|
  @a[r] = []
  0.upto(10_000) do |c|
    key = "#{r}:#{c}"
    if rand(10) == 0 
      Cache.set(key, 1) # 10% chance of being 1
    else 
      Cache.set(key, 0)
    end
  end
end

This will be speedy and you won't have to worry about serialization and all of your systems will have access to it. I asked in a comment on the main post about accessing the data, you will have to get creative, but it should be easy to do.

Sixty4Bit 2010-06-05 20:11:12

Thanks for the response. Unfortunately I need to be able to iterate through the data very quickly (basically read through the entire object in under 150 milliseconds). This is possible if I store it in a hash or an array, but definitely not possible if each value is stored separately in memcached.

Gdeglin 2010-06-06 18:17:06

ansaurus

tags:

views:

answers:

Working with a large data object between ruby processes

related questions