ansaurus

Question

Reasons for this disparity in execution speed?

Answer 1

+5 A:

It could be because dicts in Python are much faster than hashes in Ruby

I've just run a quick test, building a hash of 12345678 item in Ruby1.8.7 took 3 times as long as Python. Ruby1.9 was about twice as long as Python.

Here is how I tested
python

$ time python -c "d={}
for i in xrange(12345678):d[i]=1"

ruby

$ time ruby -e "d={};12345678.times{|i|d[i]=1}"

Not enough to account for your discrepancy though.

Perhaps file I/O is worth looking into - comment out all the hash code and see how long the empty loops take to run over the files.

Here's another version in Python using defaultdict and context managers

from collections import defaultdict
hashes = defaultdict(int)

with open('c:\\file1.txt', 'r') as f:
    for line in f:
        hashes[line] += 1

print "Done file 1"

with open('c:\\file2.txt', 'r') as f:
    for line in f:
        if line in hashes:
            hashes[line] -= 1
        else:
            print "Hash not found!"

print "Done file 2"

num_errors = 0
for key,value in hashes.items():  # hashes.iteritems() might be better here
    if value != 0:
        print "Uneven hash count: %s" % key
        num_errors += 1

print "Total of %d mismatches found" % num_errors

gnibbler 2009-12-08 00:02:42

I believe you mean `for key, value in hashes.items()` (or `hashes.iteritems()`) rather than `hashes.keys()`.

Chris Lutz 2009-12-08 00:41:44

Good catch, thanks Chris

gnibbler 2009-12-08 00:46:48

In his case, he is hashing strings. changing your python test to use: d[str(i)]=1 gives me a fourfold increase in time. I don't know enough Ruby to change the Ruby example and compare.

Mark Peters 2009-12-08 01:13:04

@Mark, it would be `i.to_s` for ruby

gnibbler 2009-12-08 01:36:12

@gnibbler, my times:Python (int): 0m4.359sPython (str): 0m16.202sRuby (int): 0m8.092sRuby (str): 0m36.648sSeems like they scale about the same.

Mark Peters 2009-12-08 02:13:25

Answer 2

+2 A:

On the python side, you could iterate over the dictionary items like this:

for key, value in hashes.iteritems():
    if value != 0:
        print "Uneven hash count: %s" % key
        num_errors += 1

Also:

for line in f.readlines():
    hashes[line] = hashes.setdefault(line, 0) + 1

... but I can't help you with the Ruby side, other than to suggest you hunt down a profiler.

retracile 2009-12-08 00:10:29

If the dictionary has a lot of elements, you should use `hashes.iteritems()` instead.

Chris Lutz 2009-12-08 00:42:22

@Chris: Quite right; answer updated accordingly.

retracile 2009-12-08 02:13:21

Answer 3

+4 A:

I've found Ruby's reference implementation (well, Ruby) to be (unscientifically stated) dog slow.

If you have the opportunity, please try running your program under JRuby! Charles Nutter and other Sun folks claim to have sped Ruby up dramatically.

I for one would be most interested in your results.

Carl Smotricz 2009-12-08 00:10:41

Ruby 1.8.6 in particular was not very fast. And the version of Python in question is quite a bit newer. Ruby 1.9 would surely be better.

Chuck 2009-12-08 00:20:53

I tried jRuby, and while it was MUCH faster than the 1.8.6 interpreter, at ~45 seconds it was still 3x slower than Python (~15s)

t_scho 2009-12-08 00:28:21

Interesting. Thank you VERY much!

Carl Smotricz 2009-12-08 00:29:25

Answer 4

A:

I'm not a Ruby expert, so someone please correct me if I'm wrong:

I see a small optimization potential.

If you say

hashes = hash.new(0)

then a reference to an undefined hash will return 0 and store the key; and you can do

hashes[line] += 1

every time, without the enclosing if and else.

Caveat: Untested!

If storing the key doesn't happen automatically, there's yet another hash constructor using a block where you can do it explicitly.

Carl Smotricz 2009-12-08 00:43:20

Similarly, `collections.defaultdict(lambda : 0)` for Python. http://docs.python.org/library/collections.html#defaultdict-objects

Craig McQueen 2009-12-08 04:35:09

I believe that it is more typically written `collections.defaultdict(int)`.

ephemient 2009-12-08 05:48:46

Answer 5

A:

Python's dictionary is brutally fast. See http://stackoverflow.com/questions/327311/how-are-pythons-built-in-dictionaries-implemented Perhaps Ruby's is not so crash hot.

I doubt it's the hash functions. There's no way the Ruby devs would have a hash function that's an order of magnitude worse than Python's.

Perhaps Ruby 1.8 is slow at dynamically resizing large hash tables? How does your problem scale with smaller files?

wisty 2009-12-08 06:00:13

In terms of scale, the difference between inputs of 1 000 and 10 000 is minimal. The difference between 10 000 and 100 000 is nearly linear. The difference between 100 000 to near 1 000 000 is massive. This suggests to me that it is scaling exponentially.

t_scho 2009-12-08 17:56:45

Answer 6

A:

I was able to speed up your ruby code a bit as follows:

require 'benchmark'

Benchmark.bm(10) do |x|

  x.report("original version") do
    file = File.open("c:\\file1.txt", "r")
    hashes = {}

    file.each_line { |line|
      if hashes.has_key?(line)
        hashes[line] += 1
      else
        hashes[line] = 1
      end
    }

    file.close()

    #puts "Done file 1"

    not_founds = 0

    file = File.open("c:\\file2.txt", "r")

    file.each_line { |line|
      if hashes.has_key?(line)
        hashes[line] -= 1
      else
        not_founds += 1        
      end
    }

    file.close()

    #puts "Done file 2"

    num_errors = 0
    hashes.each_key{ |key|
      if hashes[key] != 0
        num_errors += 1
      end
    }

    puts "Total of #{not_founds} lines not found in file2"
    puts "Total of #{num_errors} mismatches found"

  end


  x.report("my speedup") do
    hashes = {}
    File.open("c:\\file1.txt", "r") do |f|
      lines = f.readlines
      lines.each { |line|
        if hashes.has_key?(line)
          hashes[line] += 1
        else
          hashes[line] = 1
        end
      }
    end  

    not_founds = 0

    File.open("c:\\file2.txt", "r") do |f|
      lines = f.readlines
      lines.each { |line|
        if hashes.has_key?(line)
          hashes[line] -= 1
        else
          not_founds += 1
        end
      }
    end

    num_errors = hashes.values.to_a.select { |z| z != 0}.size   

    puts "Total of #{not_founds} lines not found in file2"
    puts "Total of #{num_errors} mismatches found"

  end

end

so i read the files in one bug chunk, this is in my case a bit faster (i tested on Windows XP, ruby 1.8.6 and a file of 100000 lines). I benchmarked all different ways to read files (i could think off), and this was the fastest way. Also i did speed up the counting of the values in a hash a bit, but this is only visible if you did it for very large numbers :)

So i get a very small speed-increase here. The output on my machine is as follows:

                user     system      total        real
original versionTotal of 16 lines not found in file2
Total of 4 mismatches found
   1.000000   0.015000   1.015000 (  1.016000)
my speedup v1Total of 16 lines not found in file2
Total of 4 mismatches found
   0.812000   0.047000   0.859000 (  0.859000)

Who has any ideas to improve this further?

If the f.readlines goes slower, because of the size, i found that

File.open("c:\\file2.txt", "r") do |f|
  while (line=f.gets)
    if hashes.has_key?(line)
      hashes[line] -= 1
    else
      not_founds += 1
    end
  end
end

is just a tad quicker for me.

I was thinking about about a way to improve the

if hashes.has_key?(line) ...

code a bit, but could not think of anything.

Have you tried using Ruby 1.9?

I have a Windows 7 Virtual Machine with Ruby 1.9.1, and there the f.readlines was slower, and i needed to use the while (line=f.gets) because of the memory limitations :)

Since a lot uf Ruby users test mainly on Unix related platforms, i guess that could explain why the code is sub-optimal on Windows. Has anybody compared the above mentioned performance on Unix? Is this a ruby vs. python problem, or Ruby-windows vs. Ruby-Unix?

nathanvda 2009-12-08 13:08:55

ansaurus

tags:

views:

answers:

Reasons for this disparity in execution speed?

related questions