views:

292

answers:

6

I have a Ruby hash that reaches approximately 10 megabytes if written to a file using Marshal.dump. After gzip compression it is approximately 500 kilobytes.

Iterating through and altering this hash is very fast in ruby (fractions of a millisecond). Even copying it is extremely fast.

The problem is that I need to share the data in this hash between Ruby on Rails processes. In order to do this using the Rails cache (file_store or memcached) I need to Marshal.dump the file first, however this incurs a 1000 millisecond delay when serializing the file and a 400 millisecond delay when serializing it.

Ideally I would want to be able to save and load this hash from each process in under 100 milliseconds.

One idea is to spawn a new Ruby process to hold this hash that provides an API to the other processes to modify or process the data within it, but I want to avoid doing this unless I'm certain that there are no other ways to share this object quickly.

Is there a way I can more directly share this hash between processes without needing to serialize or deserialize it?

Here is the code I'm using to generate a hash similar to the one I'm working with:

@a = []
0.upto(500) do |r|
  @a[r] = []
  0.upto(10_000) do |c|
    if rand(10) == 0 
      @a[r][c] = 1 # 10% chance of being 1
    else
      @a[r][c] = 0
    end
  end
end

@c = Marshal.dump(@a) # 1000 milliseconds
Marshal.load(@c) # 400 milliseconds

Update:

Since my original question did not receive many responses, I'm assuming there's no solution as easy as I would have hoped.

Presently I'm considering two options:

  1. Create a Sinatra application to store this hash with an API to modify/access it.
  2. Create a C application to do the same as #1, but a lot faster.

The scope of my problem has increased such that the hash may be larger than my original example. So #2 may be necessary. But I have no idea where to start in terms of writing a C application that exposes an appropriate API.

A good walkthrough through how best to implement #1 or #2 may receive best answer credit.

Update 2

I ended up implementing this as a separate application written in Ruby 1.9 that has a DRb interface to communicate with application instances. I use the Daemons gem to spawn DRb instances when the web server starts up. On start up the DRb application loads in the necessary data from the database, and then it communicates with the client to return results and to stay up to date. It's running quite well in production now. Thanks for the help!

A: 

be careful with memcache, it has some object size limitations (2mb or so)

One thing to try is to use MongoDB as your storage. It is pretty fast and you can map pretty much any data structure into it.

Zepplock
Yes. I did hit the memcached limit. It was 1MB. I was able to get around this by gzip compressing the object after using Marshal.dump on it and it fit under the limit, but hurt performance even further.
Gdeglin
A: 

If it's sensible to wrap your monster hash in a method call, you might simply present it using DRb - start a small daemon that starts a DRb server with the hash as the front object - other processes can make queries of it using what amounts to RPC.

More to the point, is there another approach to your problem? Without knowing what you're trying to do, it's hard to say for sure - but maybe a trie, or a Bloom filter would work? Or even a nicely interfaced bitfield would probably save you a fair amount of space.

Judson
I have a related question that explains what kind of data is stored and why here: http://stackoverflow.com/questions/2878429/algorithm-for-finding-similar-users-through-a-join-tableDRb sounds interesting. I'm hesitant to jump into it since it seems to be really old (last update June 04) and a lot of google search results for it are bug discussions. One possibility might be a Sinatra app that exposes an API to act on the hash (at least that would likely be simpler and better understood).
Gdeglin
The "last update" issue I think comes from the fact that it's part of the Ruby stdlib. The best thing to look at I think is just the rubydoc for it at ruby.org/stdlib.
Judson
A: 

Have you considered upping the memcache max object size?

Versions greater than 1.4.2

memcached -I 11m #giving yourself an extra MB in space

or on previous versions changing the value of POWER_BLOCK in the slabs.c and recompiling.

Mike Buckbee
Yes. But that's not the problem. Simply putting the object into memcached and pulling it out of memcached takes too long since the object must go through a serialization and deserialization process.
Gdeglin
+1  A: 

A sinatra app will work, but the {un}serializing, and the HTML parsing could impact performance compared to a DRb service.

Here's an example, based on your example in the related question. I'm using a hash instead of an array so you can use user ids as indexes. This way there is no need to keep both a table on interests and a table of user ids on the server. Note that the interest table is "transposed" compared to your example, which is the way you want it anyways, so it can be updated in one call.

# server.rb
require 'drb'

class InterestServer < Hash
  include DRbUndumped # don't send the data over!

  def closest(cur_user_id)
    cur_interests = fetch(cur_user_id)
    selected_interests = cur_interests.each_index.select{|i| cur_interests[i]}

    scores = map do |user_id, interests|
      nb_match = selected_interests.count{|i| interests[i] }
      [nb_match, user_id]
    end
    scores.sort!
  end
end

DRb.start_service nil, InterestServer.new
puts DRb.uri

DRb.thread.join


# client.rb

uri = ARGV.shift
require 'drb'
DRb.start_service
interest_server = DRbObject.new nil, uri


USERS_COUNT = 10_000
INTERESTS_COUNT = 500

# Mock users
users = Array.new(USERS_COUNT) { {:id => rand(100000)+100000} }

# Initial send over user interests
users.each do |user|
  interest_server[user[:id]] = Array.new(INTERESTS_COUNT) { rand(10) == 0 }
end

# query at will
puts interest_server.closest(users.first[:id]).inspect

# update, say there's a new user:
new_user = {:id => 42}
users << new_user
# This guy is interested in everything!
interest_server[new_user[:id]] = Array.new(INTERESTS_COUNT) { true } 

puts interest_server.closest(users.first[:id])[-2,2].inspect
# Will output our first user and this new user which both match perfectly

To run in terminal, start the server and give the output as the argument to the client:

$ ruby server.rb
druby://mal.lan:51630

$ ruby client.rb druby://mal.lan:51630
[[0, 100035], ...]

[[45, 42], [45, 178902]]
Marc-André Lafortune
Thanks. I'm actually pretty far along with a sinatra app that connects to the same database to initially load the data, then has a REST api to keep things up to date and to fetch common users when sent a user id. DRb seems pretty elegant from your example so I think I'll give that a shot too.
Gdeglin
This worked great. Thanks again.
Gdeglin
+2  A: 

Maybe it's too obvious, but if you sacrifice a little access speed to the members of your hash, a traditional database will give you much more constant time access to values. You could start there and then add caching to see if you could get enough speed from it. This will be a little simpler than using Sinatra or some other tool.

ndp
Thanks for the response. The data actually is stored in a traditional database. I'm copying it all into a hash because it's just too slow to run the queries I need with SQL. You can get a better idea of what I'm using this for by looking at this question: http://stackoverflow.com/questions/2878429/algorithm-for-finding-similar-users-through-a-join-table
Gdeglin
A: 

What about storing the data in Memcache instead of storing the Hash in Memcache? Using your code above:

@a = []
0.upto(500) do |r|
  @a[r] = []
  0.upto(10_000) do |c|
    key = "#{r}:#{c}"
    if rand(10) == 0 
      Cache.set(key, 1) # 10% chance of being 1
    else 
      Cache.set(key, 0)
    end
  end
end

This will be speedy and you won't have to worry about serialization and all of your systems will have access to it. I asked in a comment on the main post about accessing the data, you will have to get creative, but it should be easy to do.

Sixty4Bit
Thanks for the response. Unfortunately I need to be able to iterate through the data very quickly (basically read through the entire object in under 150 milliseconds). This is possible if I store it in a hash or an array, but definitely not possible if each value is stored separately in memcached.
Gdeglin