views:

745

answers:

4

I have a binary number (52 bits) represented as a string "01100011...."

What would be the quickest way to count the number of 1's?

"01100011....".count("1")

obviously works but is quite time consuming if this operation needs to be done thousands of times.

ok, some more info. I am trying to create bit vectors for words as follows

def bit_vec(str)
    alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'
    bv = ""
    alphabet.each_char do |a|
     if str.include?(a)
      bv += "1"
     else
      bv += "0"
     end
    end
        bv
end

The bit_vec method gets called about 170K times. I store the bit vectors in a hash and use them to find similar words for a given word by XOR'ing the bit vectors and counting the number of 1's (more 1's == less similarity). If the count method does not use String#scan the what else could use it?

I know Ruby is slower than say C or Java. I am just looking to improve the algorithm the best I can. I am not looking for raw speed.

Maybe the include? method is the bottleneck?

+8  A: 

You are going to have O(n) performance, no matter what. Try this simple ruby command. Measure if it's really a problem.

This simple script, measured with time ruby test.rb took 0.058 CPU seconds. This is on an old 1.25 Ghz processor. Are you really sure this operation is too slow?

10000.times do
  "0100010000100001111101101000111110000001101001010".count("1")
end

If that isn't fast enough write a C extension. Try to avoid using conditionals. Write it like this:

count = 0;
for (i = stringLength; i; i++) {
    count += string[i] - '0';     // no conditional used.
}

But honestly, if you need that kind of speed ruby is the wrong language for you. There are so many different things in ruby that take much more time than a simple .count("1").

Georg
I ran my program with rprofile and it showed that 38% of the time is being spent using String#scan (170K calls). I thought that if there was a clever way to count the 1's I might be able to speed things up a little.
Maulin
I don't think `.count()` uses `String#scan` internally. Maybe your bottleneck is somewhere else.
Georg
How can I find out what methods use String#scan internally?
Maulin
Use a profiler that tells you or look at the source. http://www.ruby-lang.org/en/downloads/
Georg
+2  A: 

from http://www.bergek.com/2009/03/11/count-number-of-bits-in-a-ruby-integer/

yourString.scan(/1/).size


from http://snippets.dzone.com/posts/show/4233

count = 0
count += byte & 1 and byte >>= 1 until byte == 0


Here is a post with different loops (in c) for counting based on the density of 0's vs. 1's

http://gurmeetsingh.wordpress.com/2008/08/05/fast-bit-counting-routines/

willoller
`scan(/1/).size` is more than 10 times slower than `count("1")`.
Georg
I benchmarked it and it's true. String#count is fast!
tadman
+1 for bitshift loop.
EmFi
+1  A: 

Split the string in 8, look up each entry in a 128 entry lookup table and sum them up?

I know.. this is ridiculous... just sharing some ideas ;-)

JRL
+1 - This may be the fastest approach, as ridiculous as it sounds. Even if they were grouped in groups of 4, it may be a bit slower but the lookup table would be so much smaller.
James Black
Hashing the 8 bytes is going to take longer than just counting the occurrences of 1s.
Georg
If you split by 8, wouldn't you need 256 entries in your lookup table?
Carl Norum
This will be slower than `count` in Ruby, almost guaranteed.
Bob Aman
+4  A: 

Note that the problem of counting 1-bits is referred to as a "population count".

At least in Ruby, stick with handling these as a string via the count method unless you have a compelling reason to use integers.

count:

Benchmark: 78.60s for 10,000,000 iterations (127,225.63 iterations per second)

For integer math,

If you don't care about values above 2**32,

def popcount(x)
  m1 = 0x55555555
  m2 = 0x33333333
  m4 = 0x0f0f0f0f
  x -= (x >> 1) & m1
  x = (x & m2) + ((x >> 2) & m2)
  x = (x + (x >> 4)) & m4
  x += x >> 8
  return (x + (x >> 16)) & 0x3f
end

Benchmark: 105.73s for 10,000,000 iterations (94,579.03 iterations per second)

If you do care about values above 2**32,

def popcount(x)
  b = 0
  while x > 0
    x &= x - 1
    b += 1
  end
  return b
end

Benchmark: 365.59s for 10,000,000 iterations (27,353.27 iterations per second)

Addenda:

Your code:

Benchmark: 78.25s for 1,000,000 iterations (12,779.56 iterations per second)

This code:

def bit_vec(str)
  # alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz"
  bv = "0" * 52
  str.each_char do |c|
    ord = c.ord
    next unless (ord >= 65 && ord <= 90) || (ord >= 97 && ord <= 122)
    index = ord - 65
    index -= 6 if index > 25
    bv[index] = "1"
    break if bv == "1111111111111111111111111111111111111111111111111111"
  end
  bv
end

Note: You said that you were dealing with a 52-bit number, so I assumed that you cared about both upper and lower case letters (26 + 26 = 52). I opted to check for uppercase first because that's how they appear in pretty much every character set ever, making the calculations a little easier.

Benchmark: 24.86s for 1,000,000 iterations (40,231.60 iterations per second)

3.14x speed-up.

Bob Aman
Thanks for the detailed answer. That helped!
Maulin