views:

58

answers:

2

I have written some Ruby code to import the Google n-gram data into a hash table, mapping word unigrams to their respective counts. I'm using symbols as opposed to strings for the keys. I've been running this code on a linux box for a while now with no problems. Running it on my Mac this morning yielded a symbol table overflow runtime error after loading about 2 million key-value pairs. I don't understand what is causing this error. Anyone have suggestions on what might be the cause? I'm running Ruby 1.9.1 under OS X 10.5.8.

+1  A: 

Is the difference 64-bit bs. 32-bit ruby? I suspect this because of your observation

yielded a symbol table overflow runtime error after loading about 2 million key-value pairs

If this is the case then you can do nothing about it but using a native 64-bit build of ruby if strings are not an option due to application design. Otherwise you'll have to go with strings. Conversion is easy:

:symbol.to_s == "symbol"
"symbol".to_sym == :symbol
hurikhan77
or use strings!
Peter
Think you hit the problem dead on! Thanks!
Chris
+2  A: 

While using Symbol for keys instead of String is generally more efficient, the amount of efficiency gained is proportionate to the level of duplication involved. Since your keys are by definition unique, you should probably just use String keys to avoid jamming the Symbol table full of entries.

tadman
I'm assuming there is some savings on the lookup but as far as how much, it's not clear. So strings may very well be sufficient.
Chris
On lookup there would only be a savings if the key you're trying to resolve has been encoded previously, and even then it's easy to argue it's less efficient to hash the string into symbol, then hash the symbol itself than to simply hash the string. The entire symbol space functions as a hash, after all.
tadman