I need to insert 10-million strings into a C++ STL set. The strings are sorted. Will I have a pathological problem if I insert the strings in sorted order? Should I randomize first? Or will the G++ STL implementation automatically rebalance for me?
The set implementation typically uses a red-black tree, which will rebalance for you. However, insertion may be faster (or it may not) if you randomise the data before inserting - the only way to be sure is to do a test with your set implementation and specific data. Retrieval times will be the same, either way.
http://en.wikipedia.org/wiki/Standard_Template_Library
set: "Implemented using a self-balancing binary search tree."
The implementation will re-balance automatically. Given that you know the input is sorted, however, you can give it a bit of assistance: You can supply a "hint" when you do an insertion, and in this case supplying the iterator to the previously inserted item will be exactly the right hint to supply for the next insertion. In this case, each insertion will have amortized constant complexity instead of the logarithmic complexity you'd otherwise expect.
g++'s libstdc++ uses red black trees for sets and maps.
http://en.wikipedia.org/wiki/Red-black_tree
This is a self balancing tree, and insertions are always O(log n). The C++ standard also requires that all implementations have this characteristic, so in practice, they are almost always red black trees, or something very similar.
So don't worry about the order you put the elements in.
The only question I have: do you really need a set
?
If the data is already sorted and you don't need to insert / delete elements after the creation, a deque
would be better:
- you'll have the same big-O complexity using a binary search for retrieval
- you'll get less memory overhead... and better cache locality
On binary_search
: I suspect you need more than a ForwardIterator
for a binary search, guess this site is off again :(
A very cheap and simple solution is to insert from both ends of your collections of strings. That is to say, first add "A", then "ZZZZZ", then "AA", then "ZZZZY", etcetera until you meet in the middle. It doesn't require the hefty cost of shuffling, yet it is likely to sidestep pathological cases.