tags:

views:

146

answers:

7

I need to insert 10-million strings into a C++ STL set. The strings are sorted. Will I have a pathological problem if I insert the strings in sorted order? Should I randomize first? Or will the G++ STL implementation automatically rebalance for me?

+4  A: 

The set implementation typically uses a red-black tree, which will rebalance for you. However, insertion may be faster (or it may not) if you randomise the data before inserting - the only way to be sure is to do a test with your set implementation and specific data. Retrieval times will be the same, either way.

anon
One of my gripes with the stl set, is that you can't allocate memory for it ahead of time. Which would be my concern if you're putting 10 million strings into an stl set.
Jacob Schlather
red-black tree guarantees O(log N) complexity for insertion operations. Why to randomize?
Kirill V. Lyadvinsky
@Kyril It guarantees that as a maximum - randomised order may give better performance, as less rebalancing will probably need to be done.
anon
@Lib I've certainly inserted a million or so strings into a set with no problems.
anon
@Liberalkid: can't you specify a custom allocator as the 3rd template parameter to `std::set<>`, eg using `boost::pool`? This would allow you to use something like a pool based allocator.
the_mandrill
I just had a sum of subsets problem in a programming contest, where stl::set was too slow, but java's hash table wasn't. I've been bitter ever since. @Mandrill that does sound like a good idea though.
Jacob Schlather
When will people accept that java isn't slow anymore!
Peter Recore
@Peter Who mentioned Java, or its slowness?
anon
@Neil Butterworth: Liberalkid mentioned java in the comment directly above mine.
Peter Recore
@Peter Sorry, I missed that - it went under the ads on the RHS of the page. anyway, he was saying Java was faster, I think!
anon
+1  A: 

http://en.wikipedia.org/wiki/Standard_Template_Library

set: "Implemented using a self-balancing binary search tree."

Tom Sirgedas
A: 

Maybe 'unordered_set' can be an alternative.

Zitrax
+2  A: 

The implementation will re-balance automatically. Given that you know the input is sorted, however, you can give it a bit of assistance: You can supply a "hint" when you do an insertion, and in this case supplying the iterator to the previously inserted item will be exactly the right hint to supply for the next insertion. In this case, each insertion will have amortized constant complexity instead of the logarithmic complexity you'd otherwise expect.

Jerry Coffin
Good advice, yet the tree will be rebalanced quite often.
Matthieu M.
@Matthieu: True. I'm pretty sure at least in terms of complexity it's better than shuffling the data first though. With shuffled data, the overall complexity is O(N lg N) since you have to search for the insertion point for every new element. With the data sorted, each insertion has amortized constant complexity, so the overall complexity is amortized O(N). It may still be open to question whether it's better in practice though. If you can hold all the data in memory, you could try building the tree perfectly from the beginning (recursively bisect the data).
Jerry Coffin
I like this bisection idea, though I suppose that in practice, given the number of records, it would be slower than rebalancing because of cache issues as we keep hitting new pages of memory.
Matthieu M.
+1  A: 

g++'s libstdc++ uses red black trees for sets and maps.

http://en.wikipedia.org/wiki/Red-black_tree

This is a self balancing tree, and insertions are always O(log n). The C++ standard also requires that all implementations have this characteristic, so in practice, they are almost always red black trees, or something very similar.

So don't worry about the order you put the elements in.

catphive
+2  A: 

The only question I have: do you really need a set ?

If the data is already sorted and you don't need to insert / delete elements after the creation, a deque would be better:

  • you'll have the same big-O complexity using a binary search for retrieval
  • you'll get less memory overhead... and better cache locality

On binary_search: I suspect you need more than a ForwardIterator for a binary search, guess this site is off again :(

Matthieu M.
No, the documentation is correct. binary_search uses "advance" which is constant time for random-access iterator, and linear for ForwardIterator. So ForwardIterator is the minimum requirement for the algorithm. See footnote on http://www.sgi.com/tech/stl/binary_search.html.
BennyG
I'd rather use a set because, frankly, that's the functionality I need.
vy32
@BennyG: thanks for this. As noted on both sites for non RandomAccessIterators the number of steps is linear, not logarithmic. I had somehow assumed that we could only have a logarithmic complexity.
Matthieu M.
@vy32: Since you worried about performance I had thought you were ready to get your hands dirty. If you just want to keep using a `set`, measure the time and if it's good enough for you don't worry about it.
Matthieu M.
Turns out that you were right! I was able to create the list in advance and just use the binary iterator. I'm storing the list pre-compiled in the program as a huge array with 200,000+ elements. (I ended up writing my own binary search, rather than using the C++ vector.)
vy32
+1  A: 

A very cheap and simple solution is to insert from both ends of your collections of strings. That is to say, first add "A", then "ZZZZZ", then "AA", then "ZZZZY", etcetera until you meet in the middle. It doesn't require the hefty cost of shuffling, yet it is likely to sidestep pathological cases.

MSalters