ansaurus

Question

What's the most efficient way to partition a large hash into N smaller hashes in Ruby?

Answer 1

+4 A:

I don't see how you can achieve this using an unmodified "vanilla" Hash - I'd expect that you'd need to get into the internals in order to make partitioning into some kind of bulk memory-copying operation. How good is your C?

I'd be more inclined to look into partitioning instead of creating a Hash in the first place, especially if the only reason for the 200K-item Hash existing in the first place is to be subdivided.

EDIT: After thinking about it at the gym...

The problem with finding some existing solution is that someone else needs to have (a) experienced the pain, (b) had the technical ability to address it and (c) felt community-friendly enough to have released it into the wild. Oh, and for your OS platform.

What about using a B-Tree instead of a Hash? Hold your data sorted by key and it can be traversed by memcpy(). B-Tree retrieval is O(log N), which isn't much of a hit against Hash most of the time.

I found something here which might help, and I'd expect there'd only be a little duck-typing wrapper needed to make it quack like a Hash.

Still gonna need those C/C++ skills, though. (Mine are hopelessly rusty).

Mike Woodhouse 2009-10-14 11:47:52

Pretty good - 15years of it but if it's already been done ...

Chris McCauley 2009-10-14 12:12:53

Thanks Mike but the link is broken. Did you mean the r/b-tree implementation?

Chris McCauley 2009-10-15 07:40:43

Um. Yes. I changed the link. Wasn't necessarily suggesting using it directly but perhaps use it for ideas.

Mike Woodhouse 2009-10-16 08:07:15

Answer 2

A:

This probably isn't fast enough for your needs (which sound like they'll require an extension in C), but perhaps you could use Hash#select?

I agree with Mike Woodhouse's idea. Is it possible for you to construct your shards at the same place where the original 200k-item hash is being constructed? If the items are coming out of a database, you could split your query into multiple disjoint queries, based either on some aspect of the key or by repeatedly using something like LIMIT 10000 to grab a chunk at a time.

Additional

Hi Chris, I just compared your approach to using Hash#select:

require 'benchmark'

s = {}
1.upto(200_000) { |i| s[i] = i}

Benchmark.bm do |x|
  x.report {
    pivot = s.size / 2
    slices = s.each_slice(pivot)
    s1 = Hash[*slices.entries[0].flatten]
    s2 = Hash[*slices.entries[1].flatten]
  }
  x.report {
    s1 = {}
    s2 = {}
    s.each_pair do |k,v|
      if k < 100_001
        s1[k] = v
      else
        s2[k] = v
      end
    end
  }
end

It looks like Hash#select is much faster, even though it goes through the entire large hash for each one of the sub-hashes:

# ruby test.rb 
      user     system      total        real
  0.560000   0.010000   0.570000 (  0.571401)
  0.320000   0.000000   0.320000 (  0.323099)

Hope this helps.

Mark Westling 2009-10-14 12:11:47

Thanks, they aren't coming from a database and unfortunately other parts of the system want to treat the data as a hash. You can guess from the question that I was hoping that a C extension would be available as it's probably the only way to meet all the needs.

Chris McCauley 2009-10-14 12:29:43

One other idea -- if you're sharding, you're probably sending the sub-hashes off to other servers, right? If so, if it possible for you to edit the stringify'd version? Let's say you're using JSON. Could you create a JSON string for the original hash and then split the string? You should be able to inspect the characters in the middle of the string, see if you're on a key-value descriptor, then move to the next key-value boundary. This will be a huge string but if you're using multiple servers, it's a step you might need to do anyway.

Mark Westling 2009-10-14 12:57:07

I hadn't thought of that - it might be a less painful way to deal with the entire hash as a block of memory.

Mike Woodhouse 2009-10-14 15:10:35

That is a very good idea. Definitely worth trying

Chris McCauley 2009-10-14 15:32:27

Wow thanks Mark. I'm going to check your benchmarks with my data (the value part of key-value are shortish string) and see if I see the same boost. You know, I'm fast coming to the conclusion that Ruby is fast enough at this.

Chris McCauley 2009-10-15 10:58:21

Mark, I checked my version against yours and yours runs in just over 5s whereas mine took just over 9s so the approach which intuitively doesn't perform as well (because it needs to manage a counter and evaluate a conditional) actually performs significantly better - I'm marking this as the correct answer becuase it gives a huge performance boost to me and you actually went to the trouble of measuring the performance of your suggestion - thanks

Chris McCauley 2009-10-16 09:17:12

ansaurus

tags:

views:

answers:

What's the most efficient way to partition a large hash into N smaller hashes in Ruby?

The Problem

The initial approach

A better solution

The results

Take away

related questions