tags:

views:

358

answers:

6

I wrote a quick Python script to compare two files, each containing unordered hashes, in order to verify that both files are identical aside from order. Then I rewrote it in Ruby for educational purposes.

The Python implementation takes seconds, while the Ruby implementation takes around 4 minutes.

I have a feeling this is most likely due to my lack of Ruby knowledge, any ideas on what I am doing wrong?

Environment is Windows XP x64, Python 2.6, Ruby 1.8.6

Python

f = open('c:\\file1.txt', 'r')

hashes = dict()

for line in f.readlines():
    if not line in hashes:
        hashes[line] = 1
    else:
        hashes[line] += 1


print "Done file 1"

f.close()

f = open('c:\\file2.txt', 'r')

for line in f.readlines():
    if not line in hashes:
        print "Hash not found!"
    else:
        hashes[line] -= 1

f.close()

print "Done file 2"

num_errors = 0

for key in hashes.keys():
    if hashes[key] != 0:
        print "Uneven hash count: %s" % key
        num_errors += 1

print "Total of %d mismatches found" % num_errors

Ruby

file = File.open("c:\\file1.txt", "r")
hashes = {}

file.each_line { |line|
  if hashes.has_key?(line)
    hashes[line] += 1
  else
    hashes[line] = 1
  end
}

file.close()

puts "Done file 1"

file = File.open("c:\\file2.txt", "r")

file.each_line { |line|
  if hashes.has_key?(line)
    hashes[line] -= 1
  else
    puts "Hash not found!"
  end
}

file.close()

puts "Done file 2"

num_errors = 0
hashes.each_key{ |key|
  if hashes[key] != 0
    num_errors += 1
  end
}

puts "Total of #{num_errors} mismatches found"

EDIT To give an idea of scale, each file is pretty big, over 900 000 hashes.

PROGRESS

Using a nathanvda's suggestions, here is the optimized ruby script:

f1 = "c:\\file1.txt"
f2 = "c:\\file2.txt"

hashes = Hash.new(0)

File.open(f1, "r") do |f|
  while line = f.gets
    hashes[line] += 1
  end
end  

not_founds = 0

File.open(f2, "r") do |f|
  while line = f.gets
    if hashes.has_key?(line)
      hashes[line] -= 1
    else
      not_founds += 1
    end
  end
end

num_errors = hashes.values.to_a.select { |z| z != 0}.size   

puts "Total of #{not_founds} lines not found in file2"
puts "Total of #{num_errors} mismatches found"

On windows with Ruby 1.8.7, the original version took 250 seconds and the optimized version took 223.

On a linux VM! running ruby 1.9.1, the original version ran in 81 seconds, about 1/3 the time as windows 1.8.7. Interestingly, the optimized version took longer at 89 seconds. Note that while line = ... was necessary due to memory constraints.

On windows with Ruby 1.9.1, the original took 457 seconds and the optimized version took 543.

On windows with jRuby, the original took 45 seconds and the optimized version took 43.

I am somewhat surprised by the results, I was expecting 1.9.1 to be an improvement over 1.8.7.

+5  A: 

It could be because dicts in Python are much faster than hashes in Ruby

I've just run a quick test, building a hash of 12345678 item in Ruby1.8.7 took 3 times as long as Python. Ruby1.9 was about twice as long as Python.

Here is how I tested
python

$ time python -c "d={}
for i in xrange(12345678):d[i]=1"

ruby

$ time ruby -e "d={};12345678.times{|i|d[i]=1}"

Not enough to account for your discrepancy though.

Perhaps file I/O is worth looking into - comment out all the hash code and see how long the empty loops take to run over the files.

Here's another version in Python using defaultdict and context managers

from collections import defaultdict
hashes = defaultdict(int)

with open('c:\\file1.txt', 'r') as f:
    for line in f:
        hashes[line] += 1

print "Done file 1"

with open('c:\\file2.txt', 'r') as f:
    for line in f:
        if line in hashes:
            hashes[line] -= 1
        else:
            print "Hash not found!"

print "Done file 2"

num_errors = 0
for key,value in hashes.items():  # hashes.iteritems() might be better here
    if value != 0:
        print "Uneven hash count: %s" % key
        num_errors += 1

print "Total of %d mismatches found" % num_errors
gnibbler
I believe you mean `for key, value in hashes.items()` (or `hashes.iteritems()`) rather than `hashes.keys()`.
Chris Lutz
Good catch, thanks Chris
gnibbler
In his case, he is hashing strings. changing your python test to use: d[str(i)]=1 gives me a fourfold increase in time. I don't know enough Ruby to change the Ruby example and compare.
Mark Peters
@Mark, it would be `i.to_s` for ruby
gnibbler
@gnibbler, my times:Python (int): 0m4.359sPython (str): 0m16.202sRuby (int): 0m8.092sRuby (str): 0m36.648sSeems like they scale about the same.
Mark Peters
+2  A: 

On the python side, you could iterate over the dictionary items like this:

for key, value in hashes.iteritems():
    if value != 0:
        print "Uneven hash count: %s" % key
        num_errors += 1

Also:

for line in f.readlines():
    hashes[line] = hashes.setdefault(line, 0) + 1

... but I can't help you with the Ruby side, other than to suggest you hunt down a profiler.

retracile
If the dictionary has a lot of elements, you should use `hashes.iteritems()` instead.
Chris Lutz
@Chris: Quite right; answer updated accordingly.
retracile
+4  A: 

I've found Ruby's reference implementation (well, Ruby) to be (unscientifically stated) dog slow.

If you have the opportunity, please try running your program under JRuby! Charles Nutter and other Sun folks claim to have sped Ruby up dramatically.

I for one would be most interested in your results.

Carl Smotricz
Ruby 1.8.6 in particular was not very fast. And the version of Python in question is quite a bit newer. Ruby 1.9 would surely be better.
Chuck
I tried jRuby, and while it was MUCH faster than the 1.8.6 interpreter, at ~45 seconds it was still 3x slower than Python (~15s)
t_scho
Interesting. Thank you VERY much!
Carl Smotricz
A: 

I'm not a Ruby expert, so someone please correct me if I'm wrong:

I see a small optimization potential.

If you say

hashes = hash.new(0)

then a reference to an undefined hash will return 0 and store the key; and you can do

hashes[line] += 1

every time, without the enclosing if and else.

Caveat: Untested!

If storing the key doesn't happen automatically, there's yet another hash constructor using a block where you can do it explicitly.

Carl Smotricz
Similarly, `collections.defaultdict(lambda : 0)` for Python. http://docs.python.org/library/collections.html#defaultdict-objects
Craig McQueen
I believe that it is more typically written `collections.defaultdict(int)`.
ephemient
A: 

Python's dictionary is brutally fast. See http://stackoverflow.com/questions/327311/how-are-pythons-built-in-dictionaries-implemented Perhaps Ruby's is not so crash hot.

I doubt it's the hash functions. There's no way the Ruby devs would have a hash function that's an order of magnitude worse than Python's.

Perhaps Ruby 1.8 is slow at dynamically resizing large hash tables? How does your problem scale with smaller files?

wisty
In terms of scale, the difference between inputs of 1 000 and 10 000 is minimal. The difference between 10 000 and 100 000 is nearly linear. The difference between 100 000 to near 1 000 000 is massive. This suggests to me that it is scaling exponentially.
t_scho
A: 

I was able to speed up your ruby code a bit as follows:

require 'benchmark'

Benchmark.bm(10) do |x|

  x.report("original version") do
    file = File.open("c:\\file1.txt", "r")
    hashes = {}

    file.each_line { |line|
      if hashes.has_key?(line)
        hashes[line] += 1
      else
        hashes[line] = 1
      end
    }

    file.close()

    #puts "Done file 1"

    not_founds = 0

    file = File.open("c:\\file2.txt", "r")

    file.each_line { |line|
      if hashes.has_key?(line)
        hashes[line] -= 1
      else
        not_founds += 1        
      end
    }

    file.close()

    #puts "Done file 2"

    num_errors = 0
    hashes.each_key{ |key|
      if hashes[key] != 0
        num_errors += 1
      end
    }

    puts "Total of #{not_founds} lines not found in file2"
    puts "Total of #{num_errors} mismatches found"

  end


  x.report("my speedup") do
    hashes = {}
    File.open("c:\\file1.txt", "r") do |f|
      lines = f.readlines
      lines.each { |line|
        if hashes.has_key?(line)
          hashes[line] += 1
        else
          hashes[line] = 1
        end
      }
    end  

    not_founds = 0

    File.open("c:\\file2.txt", "r") do |f|
      lines = f.readlines
      lines.each { |line|
        if hashes.has_key?(line)
          hashes[line] -= 1
        else
          not_founds += 1
        end
      }
    end

    num_errors = hashes.values.to_a.select { |z| z != 0}.size   

    puts "Total of #{not_founds} lines not found in file2"
    puts "Total of #{num_errors} mismatches found"

  end

end

so i read the files in one bug chunk, this is in my case a bit faster (i tested on Windows XP, ruby 1.8.6 and a file of 100000 lines). I benchmarked all different ways to read files (i could think off), and this was the fastest way. Also i did speed up the counting of the values in a hash a bit, but this is only visible if you did it for very large numbers :)

So i get a very small speed-increase here. The output on my machine is as follows:

                user     system      total        real
original versionTotal of 16 lines not found in file2
Total of 4 mismatches found
   1.000000   0.015000   1.015000 (  1.016000)
my speedup v1Total of 16 lines not found in file2
Total of 4 mismatches found
   0.812000   0.047000   0.859000 (  0.859000)

Who has any ideas to improve this further?

If the f.readlines goes slower, because of the size, i found that

File.open("c:\\file2.txt", "r") do |f|
  while (line=f.gets)
    if hashes.has_key?(line)
      hashes[line] -= 1
    else
      not_founds += 1
    end
  end
end

is just a tad quicker for me.

I was thinking about about a way to improve the

if hashes.has_key?(line) ...

code a bit, but could not think of anything.

Have you tried using Ruby 1.9?

I have a Windows 7 Virtual Machine with Ruby 1.9.1, and there the f.readlines was slower, and i needed to use the while (line=f.gets) because of the memory limitations :)

Since a lot uf Ruby users test mainly on Unix related platforms, i guess that could explain why the code is sub-optimal on Windows. Has anybody compared the above mentioned performance on Unix? Is this a ruby vs. python problem, or Ruby-windows vs. Ruby-Unix?

nathanvda