views:

2357

answers:

7

I have a situation where I need to process large (many GB's) amounts of data as such:

  1. build a large string by appending many smaller (C char*) strings
  2. trim the string
  3. convert the string into a C++ const std::string for processing (read only)
  4. repeat

The data in each iteration are independent.

My question is, I'd like to minimise (if possible eliminate) heap allocated memory usage, as it at the moment is my largest performance problem.

Is there a way to convert a C string (char*) into a stl C++ string (std::string) without requiring std::string to internally alloc/copy the data?

Alternatively, could I use stringstreams or something similar to re-use a large buffer?

Edit: Thanks for the answers, for clarity, I think a revised question would be:

How can I build (via multiple appends) a stl C++ string efficiently. And if performing this action in a loop, where each loop is totally independant, how can I re-use thisallocated space.

+1  A: 

Is each iteration independent enough that you can use the same std::string for each iteration? One would hope that your std::string implementation is smart enough to re-use memory if you assign a const char * to it when it was previously used for something else.

Assigning a char * into a std::string must always at least copy the data. Memory management is one of the main reasons to use std::string, so you won't be a able to override it.

David Norman
+2  A: 

To help with really big strings SGI has the class Rope in its STL.
Non standard but may be usefull.

http://www.sgi.com/tech/stl/Rope.html

Martin York
+7  A: 

You can't actually form a std::string without copying the data. A stringstream would probably reuse the memory from pass to pass (though I think the standard is silent on whether it actually has to), but it still wouldn't avoid the copying.

A common approach to this sort of problem is to write the code which processes the data in step 3 to use a begin/end iterator pair; then it can easily process either a std::string, a vector of chars, a pair of raw pointers, etc. Unlike passing it a container type like std::string, it would no longer know or care how the memory was allocated, since it would still belong to the caller. Carrying this idea to it's logical conclusion is boost::range, which adds all the overloaded constructors to still let the caller just pass a string/vector/list/any sort of container with .begin() and .end(), or separate iterators.

Having written your processing code to work on an arbitrary iterator range, you could then even write a custom iterator (not as hard as it sounds, basically just an object with some standard typedefs, and operator ++/*/=/==/!= overloaded to get a forward-only iterator) that takes care of advancing to the next fragment each time it hit the end of the one it's working on, skipping over whitespace (I assume that's what you meant by trim). That you never had to assemble the whole string contiguously at all. Whether or not this would be a win depends on how many fragments/how large of fragments you have. This is essentially what the SGI rope mentioned by Martin York is: a string where append forms a linked list of fragments instead of a contiguous buffer, which is thus suitable for much longer values.

puetzk
I think your solution is the best approach (chaning the processing code) unfortunatly in this situation its not an option.
Akusete
Is there a standard specified way to achive buffer reuse? I just dont want to rely on the implementation on a specific platform.
Akusete
A: 

In this case, might it be better to process the char* directly, instead of assigning it to a std::string.

Alan
Yes it would, although the inputs (C char*'s) and output (std::string) are not under my control.
Akusete
+5  A: 

Is it at all possible to use a C++ string in step 1? If you use string::reserve(size_t), you can allocate a large enough buffer to prevent multiple heap allocations while appending the smaller strings, and then you can just use that same C++ string throughout all of the remaining steps.

See this link for more information on the reserve function.

e.James
Your solution sounds like what im after, although, when you say "you can just use that same C++ string" do you mean using clear(), and then continuing to build the next string?
Akusete
This solution is the simplest for me to adopt, currently, although I'm unsure if the re-use is actually specified by the standard (although it does seem to work on my implementation)
Akusete
Using clear() should work. As far as I know, clear() does not free the memory used by the string, and so does not affect the space allocated by reserve().
e.James
+2  A: 

This is a lateral thinking answer, not directly addressing the question but "thinking" around it. Might be useful, might not...

Readonly processing of std::string doesn't really require a very complex subset of std::string's features. Is there a possibility that you could do search/replace on the code that performs all the processing on std::strings so it takes some other type instead? Start with a blank class:

class lightweight_string { };

Then replace all std::string references with lightweight_string. Perform a compilation to find out exactly what operations are needed on lightweight_string for it to act as a drop-in replacement. Then you can make your implementation work however you want.

Daniel Earwicker
A: 

I am not sure if it is a bug in std::Strings += function. When i append a very lengthy string to std::string it is actually corrupting another malloced piece of memory. This was evident when i set a access watch point in GDB and back trace clearly showed that it was += function's memcpy that was doing memory corruption. Any scientific reason why it is getting corrupted ?

Vijay Nag