ansaurus

Question

How to achieve better efficiency re-inserting into sets in C++

Answer 1

+2 A:

std::set::insert returns a pair<iterator, bool> as far as I know. In any case, directly modifying an element in any sort of set is risky. What if your modification causes the item to compare equal to another existing item? What if it changes the item's position in the total order of items in the set? Depending on the implementation, this will cause undefined behaviour.

If the item's key remains the same and only its properties change, then I think what you really want is a map or an unordered_map instead of a set.

Peter Ruderman 2010-06-21 20:32:06

The modification does not change the items key, so I don't need to worry about undefined behavior, but that is a good point. I'll check out maps, thanks for the tip.

lashleigh 2010-06-21 20:36:41

If you read carefully, you'll notice that she's removing and then reinserting the colliding item to avoid the problems you described.

Benson 2010-06-21 20:37:07

@Benson: Yes, I did notice that, but the question is can it be done directly? The answer is yes, use a map.

Peter Ruderman 2010-06-21 20:39:04

How about the other question: can it be done with minimal modification to the existing code? (i.e. can it be done with a set while maintaining an iterator for cheap inserts?)

Benson 2010-06-21 21:00:33

I think switching to a map is a fairly minor change. The only annoying point is that map iterators point to key/value pairs instead of just values. So calls like iterator->Foo() become iterator->second.Foo().

Peter Ruderman 2010-06-21 21:32:31

Answer 2

+2 A:

I agree with Peter that a map is probably a better model of what you are doing, specifically something like map<pep_key, Peptide::Peptide>, would let you do something like:

insertResult = myMap.insert(std::make_pair(newPep.keyField(), newPep));
if( insertResult.second == false )
    insertResult.first->second = newPep;

To answer your question, the insert segfaults because erase invalidates an iterator, so inserting with it (or a copy of it) is analogous to dereferencing an invalid pointer. The only way I see to do what you want is with a const_cast

insertResult = mySet.insert(newPep);
if( insertResult.second == false )
    const_cast<Peptide::Peptide&>(*(insertResult.first)) = newPep;

the const_cast approach looks like it will work for what you are doing, but is generally a bad idea.

academicRobot 2010-06-21 21:00:28

Ah, of course, that should have been obvious. What if I were to initialize another iterator with --someIter, pointing to the preceding element? I do plan to look at maps, since most people agree that would be better practice, but I'm curious. Also, thank you for the code examples.

lashleigh 2010-06-21 21:16:37

@lash - yes, if you pass insert an iterator to the position preceding the insert position, insert will be very efficient (good doc [here](http://www.cplusplus.com/reference/stl/set/insert/)). Set iterators are not invalidated by changes to the set (unlike vector iterators, for example), so the approach you describe would work well. This approach is completely safe (checking `it != container.begin()` before `--it`), unlike using const_cast.

academicRobot 2010-06-21 21:29:06

I tried this out and it didn't work. Because by saying --it, the iterator was then pointing at the wrong element in the set and erasing the wrong element. So, then I thought well I can just do it++ to erase the correct element again, but that also failed, and it was just too messy.

lashleigh 2010-06-21 21:47:18

@lash To be clear, the procedure would be `tempIt = it; --tempIt; mySet.erase(it); mySet.insert(tempIt, ...);`.

academicRobot 2010-06-21 21:50:36

@academicRobot That did the trick, I had an unrelated bug that made it seem like it wasn't working at first. Thanks!

lashleigh 2010-06-21 22:03:14

Answer 3

+2 A:

I hope it isn't bad form to answer my own question, but I would like it to be here in case someone else ever has this problem. The answer of why my attempt seg faulted was given my academicRobot, but here is the solution to make this work with a set. While I do appreciate the other answers and plan to learn about maps, this question was about efficiently re-inserting into a set.

void modifySet(set<Peptide>::iterator someIter, Peptide::Peptide newPep) {
    if( someIter == someSet.begin() ) {
        Peptide tempPep = (*someIter);
        someSet.erase(someIter);
        // Modify tempPep - this does not modify the key
        someSet.insert(tempPep);   
    }
    else {
        Peptide tempPep = (*someIter);
        anotherIter = someIter;
        --anotherIter;
        someSet.erase(someIter);
        // Modify tempPep - this does not modify the key
        someSet.insert(anotherIter, tempPep); 
     }
}

In my program this change dropped my run time by about 15%, from 32 seconds down to 27 seconds. My larger data set is currently running and I have my fingers crossed that the 15% improvement scales.

lashleigh 2010-06-21 22:18:22

+1. Not bad form at all.

academicRobot 2010-06-21 22:40:53

Indeed - not bad form at all. And for precisely the reason you mentioned - if you've got a solution, share it! It will help others if they ever have the same issue.

Mac 2010-06-21 23:41:38

I suppose you would insert `newPep` instead ? Otherwise it is not used.

Matthieu M. 2010-06-22 06:38:41

What's the chance of taking the first branch? Is this dependent on the size of your input? If it's random, the savings will approach 0% as the set grows, but the cost of the branch will remain fixed. You can effectively merge the two by using `someIter++` as the insertion location; `set.end()` is a valid hint.

MSalters 2010-06-22 07:40:38

@MSalters The standard only guarantees O(1) for insertions after the hint (23.1.2), but gcc>4.2 implements before and after, according to [this proposal](http://gcc.gnu.org/onlinedocs/libstdc++/manual/bk01pt07ch17.html), so your idea works for gcc. Do you know if this is true for other common compilers?

academicRobot 2010-06-22 15:49:25

As far as I know it's common. It's fairly easy to use a hint when it's not perfect. Also, there are N+1 possible insertion points, and N+1 iterators into a container with N elements. Hence it actually makes more sense to interpret the hint that way. This is consistent with non-ordered containers, where `insert(newElement, iter)` inserts `newElement` before `*iter`

MSalters 2010-06-23 08:59:58

Answer 4

+1 A:

As you realized set are a bit messy to deal with because you have no way to indicate which part of the object should be considered for the key and which part you can modify safely.

The usual answer is to use a map or an unordered_map (if you have access to C++0x) and cut your object in two halves: the key and the satellite data.

Beware of the typical answer: std::map<key_type, Peptide>, while it seems easy it means you need to guarantee that the key part of the Peptide object always match the key it's associated with, the compiler won't help.

So you have 2 alternatives:

Cut Peptide in two: Peptide::Key and Peptide::Data, then you can use the map safely.
Don't provide any method to alter the part of Peptide which defines the key, then you can use the typical answer.

Finally, note that there are two ways to insert in a map-like object.

insert: insert but fails if the value already exists
operator[]: insert or update (which requires creating an empty object)

So, a solution would be:

class Peptide
{
public:
  Peptide(int const id): mId(id) {}

  int GetId() const;

  void setWeight(float w);
  void setLength(float l);

private:
  int const mId;
  float mWeight;
  float mLength;
};

typedef std::unordered_map<int, Peptide> peptide_map;

Note that in case of update, it means creating a new object (default constructor) and then assigning to it. This is not possible here, because assignment means potentially changing the key part of the object.

Matthieu M. 2010-06-22 06:48:30

Thank you for the map explanation, but I don't understand your first sentence; don't I indicate which part of the object is the key by the way I define the comparison function?

lashleigh 2010-06-22 17:31:22

Yes and no, the comparison function can be arbitrarily complicated and therefore it would be near impossible for a compiler writer to check which part of the object is part of the key and which is not... and of course what's worse is that: `set` is part of the STL which is a library and not the core language itself, so should not elicit particular support from the compiler; and C++ as no syntax for "partial" const-ness. The usual workaround is to make the members not part of the key `mutable` and accessible via `const` methods, but that's more a hack that anything...

Matthieu M. 2010-06-23 06:12:17

Answer 5

A:

std::map will make your life a lot easier and I wouldn't be surprised if it outperforms std::set for this particular case. The storage of the key might seem redundant but can be trivially cheap (ex: pointer to immutable data in Peptide with your own comparison predicate to compare the pointee correctly). With that you don't have to fuss about with the constness of the value associated with a key.

If you can change Peptide's implementation, you can avoid redundancy completely by making Peptide into two separate classes: one for the key part and one for the value associated with the key.

2010-06-24 12:30:43

ansaurus

tags:

views:

answers:

How to achieve better efficiency re-inserting into sets in C++

related questions