views:

302

answers:

2

The Problem


I'm working on a problem that involves sharding. As part of the problem I need to find the fastest way to partition a large Ruby hash (> 200,0000 entries) in two or more pieces.

  • Are there any non O(n) approaches?

  • Is there a non-Ruby i.e. C/C++ implementation?

Please don't reply with examples using the trivial approach of converting the hash to an array and rebuilding N distinct hashes.

My concern is that Ruby is too slow to do this kind of work.


The initial approach


This was the first solution I tried. What was appealing about it was:

  • it didn't need to loop slavishly across the hash
  • it didn't need to manage a counter to allocate the members evenly among the shards.
  • it's short and neat looking

Ok, it isn't O(n) but it relies on methods in the standard library which I figured would be faster than writing my own Ruby code.


pivot = s.size / 2

slices = s.each_slice(pivot)

s1 = Hash[*slices.entries[0].flatten]

s2 = Hash[*slices.entries[1].flatten]


A better solution

Mark and Mike were kind enough to suggest approaches. I have to admit that Mark's approach felt wrong - it did exactly what I didn't want - it looped over all of the members of the has and evaluated a conditional as it went - but since he'd taken the time to do the evaluation, I figured that I should try a similar approach and benchmark that. This is my adapted version of his approach (My keys aren't numbers so I can't take his approach verbatim)

def split_shard(s)
    shard1 = {}
    shard2 = {}


    t = Benchmark.measure do
        n = 0

        pivot = s.size / 2

        s.each_pair do |k,v|
            if n < pivot
                shard1[k] = v
            else
                shard2[k] = v
            end

            n += 1
        end
    end

    $b += t.real
    $e += s.size
    return shard1, shard2
end


The results


In both cases, a large number of hashes are split into shards. The total number of elements across all of the hashes in the test data set was 1,680,324.

My initial solution - which had to be faster because it uses methods in the standard library and minimises the amount of Ruby code (no loop, no conditional) - runs in just over 9s

Mark's approach runs in just over 5s

That's a significant win


Take away


Don't be fooled by 'intuition' - measure the performance of competing algorithm

Don't worry about Ruby's performance as a language - my initial concern is that if I'm doing ten million of these operations, it could take a significant amount of time in Ruby but it doesn't really.

Thanks to Mark and Mike who both get points from me for their help.

Thanks!

+4  A: 

I don't see how you can achieve this using an unmodified "vanilla" Hash - I'd expect that you'd need to get into the internals in order to make partitioning into some kind of bulk memory-copying operation. How good is your C?

I'd be more inclined to look into partitioning instead of creating a Hash in the first place, especially if the only reason for the 200K-item Hash existing in the first place is to be subdivided.

EDIT: After thinking about it at the gym...

The problem with finding some existing solution is that someone else needs to have (a) experienced the pain, (b) had the technical ability to address it and (c) felt community-friendly enough to have released it into the wild. Oh, and for your OS platform.

What about using a B-Tree instead of a Hash? Hold your data sorted by key and it can be traversed by memcpy(). B-Tree retrieval is O(log N), which isn't much of a hit against Hash most of the time.

I found something here which might help, and I'd expect there'd only be a little duck-typing wrapper needed to make it quack like a Hash.

Still gonna need those C/C++ skills, though. (Mine are hopelessly rusty).

Mike Woodhouse
Pretty good - 15years of it but if it's already been done ...
Chris McCauley
Thanks Mike but the link is broken. Did you mean the r/b-tree implementation?
Chris McCauley
Um. Yes. I changed the link. Wasn't necessarily suggesting using it directly but perhaps use it for ideas.
Mike Woodhouse
A: 

This probably isn't fast enough for your needs (which sound like they'll require an extension in C), but perhaps you could use Hash#select?

I agree with Mike Woodhouse's idea. Is it possible for you to construct your shards at the same place where the original 200k-item hash is being constructed? If the items are coming out of a database, you could split your query into multiple disjoint queries, based either on some aspect of the key or by repeatedly using something like LIMIT 10000 to grab a chunk at a time.

Additional

Hi Chris, I just compared your approach to using Hash#select:

require 'benchmark'

s = {}
1.upto(200_000) { |i| s[i] = i}

Benchmark.bm do |x|
  x.report {
    pivot = s.size / 2
    slices = s.each_slice(pivot)
    s1 = Hash[*slices.entries[0].flatten]
    s2 = Hash[*slices.entries[1].flatten]
  }
  x.report {
    s1 = {}
    s2 = {}
    s.each_pair do |k,v|
      if k < 100_001
        s1[k] = v
      else
        s2[k] = v
      end
    end
  }
end

It looks like Hash#select is much faster, even though it goes through the entire large hash for each one of the sub-hashes:

# ruby test.rb 
      user     system      total        real
  0.560000   0.010000   0.570000 (  0.571401)
  0.320000   0.000000   0.320000 (  0.323099)

Hope this helps.

Mark Westling
Thanks, they aren't coming from a database and unfortunately other parts of the system want to treat the data as a hash. You can guess from the question that I was hoping that a C extension would be available as it's probably the only way to meet all the needs.
Chris McCauley
One other idea -- if you're sharding, you're probably sending the sub-hashes off to other servers, right? If so, if it possible for you to edit the stringify'd version? Let's say you're using JSON. Could you create a JSON string for the original hash and then split the string? You should be able to inspect the characters in the middle of the string, see if you're on a key-value descriptor, then move to the next key-value boundary. This will be a huge string but if you're using multiple servers, it's a step you might need to do anyway.
Mark Westling
I hadn't thought of that - it might be a less painful way to deal with the entire hash as a block of memory.
Mike Woodhouse
That is a very good idea. Definitely worth trying
Chris McCauley
Wow thanks Mark. I'm going to check your benchmarks with my data (the value part of key-value are shortish string) and see if I see the same boost. You know, I'm fast coming to the conclusion that Ruby is fast enough at this.
Chris McCauley
Mark, I checked my version against yours and yours runs in just over 5s whereas mine took just over 9s so the approach which intuitively doesn't perform as well (because it needs to manage a counter and evaluate a conditional) actually performs significantly better - I'm marking this as the correct answer becuase it gives a huge performance boost to me and you actually went to the trouble of measuring the performance of your suggestion - thanks
Chris McCauley