views:

643

answers:

4

Hey! I want to implement a copy-on-write on my custom C++ String class, and I wonder how to...

I tried to implement some options, but they all turned out very inefficient.

Thank you guys :-)

+6  A: 

In a multi-threaded environemnt (which is most of them nowadays) CoW is frequently a huge performance hit rather than a gain. And with careful use of const references, it's not much of a performance gain even in a single threaded environment.

This old DDJ article explains just how bad CoW can be in a multithreaded environment, even if there's only one thread.

Additionally, as other people have pointed out, CoW strings are really tricky to implement, and it's easy to make mistakes. That coupled with their poor performance in threading situations makes me really question their usefulness in general. This becomes even more true once you start using C++[01]x move construction and move assignment.

But, to answer your question....

Here are a couple of implementation techniques that may help with performance.

First, store the length in the string itself. The length is accessed quite frequently and eliminating the pointer dereference would probably help. I would, just for consistency put the allocated length there too. This will cost you in terms of your string objects being a bit bigger, but the overhead there in space and copying time is very small, especially since these values will then become easier for the compiler to play interesting optimization tricks with.

This leaves you with a string class that looks like this:

class MyString {
   ...
 private:
   class Buf {
      ...
    private:
      ::std::size_t refct_;
      char *data_;
   };

   ::std::size_t len_;
   ::std::size_t alloclen_;
   Buf *data_;
};

Now, there are further optimizations you can perform. The Buf class there looks like it doesn't really contain or do much, and this is true. Additionally, it requires allocating both an instance of Buf and a buffer to hold the characters. This seems rather wasteful. So, we'll turn to a common C implementation technique, stretchy buffers:

class MyString {
   ...
 private:
   struct Buf {
      ::std::size_t refct_;
      char data_[1];
   };

   void resizeBufTo(::std::size_t newsize);
   void dereferenceBuf();

   ::std::size_t len_;
   ::std::size_t alloclen_;
   Buf *data_;
};

void MyString::resizeBufTo(::std::size_t newsize)
{
   assert((data_ == 0) || (data_->refct_ == 1));
   if (newsize != 0) {
      // Yes, I'm using C's allocation functions on purpose.
      // C++'s new is a poor match for stretchy buffers.
      Buf *newbuf = ::std::realloc(data_, sizeof(*newbuf) + (newsize - 1));
      if (newbuf == 0) {
         throw ::std::bad_alloc();
      } else {
         data_ = newbuf_;
      }
   } else { // newsize is 0
      if (data_ != 0) {
         ::std::free(data_);
         data_ = 0;
      }
   }
   alloclen_ = newsize;
}

When you do things this way, you can then treat data_->data_ as if it contained alloclen_ bytes instead of just 1.

Keep in mind that in all of these cases you will have to make sure that you either never ever use this in a multi-threaded environment, or that you make sure that refct_ is a type that you have both an atomic increment, and an atomic decrement and test instruction for.

There is an even more advanced optimization technique that involves using a union to store short strings right inside the bits of data that you would use to describe a longer string. But that's even more complex, and I don't think I will feel inclined to edit this to put a simplified example here later, but you never can tell.

Omnifarious
Got a link for that analysis? I've heard similar claims, and I've always wondered about the details.
Adrian McCarthy
I argue that CoW is a huge benefit to programmer productivity on any platform. You no longer need to deal with explicit sharing, take care of references, make sure you copy when needed. Practically all modern platforms support fast atomic operations and CoW is a helluva lot cheaper than doing a deep copy every time (as Herb wants). Your poor performance argument is moot IMO.
iconiK
@iconiK, show me numbers and we'll talk. My argument is based on actual numbers resulting from empirical tests, not handwaving about the speed of atomic operations and bald assertions that deep copying is more expensive. Atomic operations require memory barriers to implement, and those can be quite expensive. I want to see numbers showing that deep copying is more expensive than atomic reference counts before I alter my stance.
Omnifarious
A: 

There's not much to CoW. Basically, you copy when you want to change it, and let anyone who doesn't want to change it keep the reference to the old instance. You'll need reference counting to keep track of who's still referencing the object, and since you're creating a new copy, you need to decrease the count on the 'old' instance. A shortcut would be to not make a copy when that count is one (you're the only reference).

Other than that, there's not much that can be said, unless there's a specific problem you're facing.

roe
sbk
@sbk: easy answer? don't use it. :) for example, you could have get/set methods for single character manipulations.
roe
@roe: But that would be one crippled string class... I remember how disgusted I was when I saw Java's charAt method on strings. Yuck
sbk
@sbk: may be, but giving someone a reference/pointer to the internals is going to hurt you either way. Even if you copy now, someone else might get a read-reference to your object at a later stage. No-one an be allowed to interact with anything but the string object. You could work around it though by implementing a character-pointer-object (returned by the [] operator) which will do the copy in the assignment operator. That's actually a pretty interesting idea, thanks!
roe
@roe: I'm glad something was born in this discussion :) Anyway, in my opinion CoW is just not worth the extra complexity. I'd rather go for the immutable string + string builder combination as in .Net/Java/Python and pretty much any other "modern" language.
sbk
thanks for the downvote y'all, now I've got a nice and even rep.. ;) Seriouosly though, care to elaborate?
roe
+3  A: 

There's a series of articles about exactly this on Herb Sutter's More Exceptional C++ book. If you don't have access to it, you can try following through the Internet articles: part 1, part 2 and part 3.

ltcmelo
+1  A: 

You may want to emulate the 'immutable' string that other languages have (Python, C# as far as I know).

The idea is that each string is immutable, thus any work on a string create a new immutable one... or that's the basic idea, to avoid explosion, you would need not to create another if there is a similar one.

Matthieu M.