tags:

views:

1379

answers:

4

Dear StackOverFlowers,

I will be parsing 60GB of text and doing a lot of insert and lookups in maps. I just started using boost::unordered_set and boost::unordered_map As my program starts filling in these containers they start growing bigger and bigger and i was wondering if this would be a good idea to pre allocate memory for these containers. something like mymap::get_allocator().allocate(N); ?

or should i just leave them to allocate and figure out grow factors by themselves? the codes look like this

boost::unordered_map <string,long> words_vs_frequency, wordpair_vs_frequency;  
boost::unordered_map <string,float> word_vs_probability, wordpair_vs_probability,
        wordpair_vs_MI;     
//... ... ...         

N = words_vs_frequency.size();
long   y =0; float MIWij =0.0f, maxMI=-999999.0f;
for (boost::unordered_map <string,long>::iterator i=wordpair_vs_frequency.begin(); 
      i!=wordpair_vs_frequency.end(); ++i){
if (i->second >= BIGRAM_OCCURANCE_THRESHOLD)
 {
 y++;
 Wij = i->first;
 WordPairToWords(Wij, Wi,Wj);
 MIWij =  log ( wordpair_vs_probability[Wij] /
    (word_vs_probability[Wi] * word_vs_probability[Wj]) 
   );

 // keeping only the pairs which MI value greater than 
 if (MIWij > MUTUAL_INFORMATION_THRESHOLD)
  wordpair_vs_MI[ Wij ] = MIWij;
 if(MIWij > maxMI )
  maxMI = MIWij; 
 }

   }

Thanks in advance

A: 

Honestly, I think you would be best off writing your own allocator. You could, for instance, make an allocator with a method called preallocate(int N) which would reserve N bytes, then use unordered_map::get_allocator() for all your fun. In addition, with your own allocator, you could tell it to grab huge chunks at a time.

rlbond
+3  A: 

I would try it both ways, which will let you generate hard data showing whether one method works better than the other. We can speculate all day about which method will be optimal, but as with most performance questions, the best thing to do is try it out and see what happens (and then fix the parts that actually need fixing).

That being said, the Boost authors seem to be very smart, so it quite possibly will work fine as-is. You'll just have to test and see.

Charlie
+6  A: 

According to the documentation, both unordered_set and unordered_map have a method

void rehash(size_type n);

that regenerates the hashtable so that it contains at least n buckets. (It sounds like it does what reserve() does for STL containers).

j_random_hacker
+1  A: 

so which is faster--letting it grow by itself or preallocating?

rogerdpack
Fixing the bucket size manually worked good till the container grew till about 75 megs. after that it was actually growing slower.So I didnt end up manually messing up with bucketsize
Sorry for loggin in after a year