views:

181

answers:

4

I'm working on some code that is normalizing a lot of data. At the end of processing, a number of key="value" pairs is written out to a file.

The "value" part could be anything, so at the point of output the values must have any embedded quotes escaped as \".

Right now, I'm using the following:

outstream << boost::regex_replace(src, rxquotesearch, quoterepl);
// (where rxquotesearch is  boost::regex("\"")  and quoterepl is "\\\\\"")

However, gprof shows I'm spending most of my execution time in this method, since I have to call it for every value for every line.

I'm curious if there is a faster way than this. I can't use std::replace since I'm replacing one character with two.

Thanks for any advice.

A: 

Well here is an implementation using string::find and string::insert, not sure if its faster, you'll have to figure that out! Here it is:

std::string src = "hey there i have \" all \" over the f\"in pla\"ce\"";
size_t n = 0;
while ( (n=src.find("\"",n)) != std::string::npos )
{
 src.insert(n,"\\");
 n+=2;
} 
std::cout << src << std::endl;

Which printed:

hey there i have \" all \" over the f\"in pla\"ce\"

DeusAduro
+1  A: 

I'm not surprised that the regex is really slow here - you're using a big, general-purpose hammer to pound in a tiny little nail. Of course, if you ended up needing to do something more interesting, the regex might quickly gain the advantage in terms of simplicity.

As for a simpler/faster approach, you could try writing the escaped string into a separate buffer one character at a time. Then it becomes trivial to add the escapes, and you don't waste any time reallocating the string or shifting characters. The biggest difficulty will be managing the size of your buffer, but you could just use a vector for that, and reuse the same vector for each string to avoid repeated allocations. The efficiency gain would depend a lot on the details of how vector works, but you can always boil it down to raw arrays and manual memory management if you need to.

The routine might look something like this, if you used vector:

vector<char> buf;
for( some_iterator it = all_the_strings.begin();
     it != all_the_strings.end(); ++it )
{
    buf.clear();
    const string & str = *it;
    for( size_t i = 0; i < str.size(); ++i )
    {
        if( str[i] == '"' || str[i] == '\\' )
            buf.push_back( '\\' );
        buf.push_back( str[i] );
    }
    buf.push_back( '\0' );

    // note: this is not guaranteed to be safe, see answer comments
    const char * escaped = &buf[0];

    // print escaped string to file here...
}
Charlie
With optimizations on, and pre-allocation, the vector should be just as fast the majority of the time (the times it isn't required to grow). Yours is likely faster than mine, given the shifting going on in mine.
DeusAduro
John Kugelman
I can believe it's probably not the best - can you elaborate as to why and/or suggest a better alternative?
Charlie
You're not supposed to access a vector's backing store directly like that. It'll probably work but it's not guaranteed to. An alternate is to use a `string` instead of `vector<char>` and call `c_str()` if you need a `char*` pointer.
John Kugelman
I figured that might be the issue. Good call then, your implementation is probably better (they're nearly the same conceptually).
Charlie
+3  A: 

If speed is a concern you should use a hand-written function to do this. Notice the use of reserve() to try to keep memory (re)allocation to a minimum.

string escape_quotes(const string &before)
{
    string after;
    after.reserve(before.length() + 4);

    for (string::size_type i = 0; i < before.length(); ++i) {
        switch (before[i]) {
            case '"':
            case '\\':
                after += '\\';
                // Fall through.

            default:
                after += before[i];
        }
    }

    return after;
}
John Kugelman
should you maybe reserve before.length+10% or so? assuming that there is at least one embedded quotation, we would need to grow?
DeusAduro
Yeah, fudge with that as needed.
John Kugelman
So, I'm curious about the comment I posted below. Any thoughts why this would be faster than outputting each character directly to the stream within the for loop?
Joe
A minor possible optimization to try would be to reuse the same "after" string object over and over (ie: pass the pointer to after as a parameter and make the function return void) that would reduce the number of constructor and memory allocation calls as it should only re-allocate memory if as it hits a string larger than it currently has space for.
KPexEA
+2  A: 
KPexEA
Interestingly, I took John Kugelman's snippet and used it to instead write directly to the stream, and it measurably slower than first building the temporary string then writing the whole string to the stream. Not what I was expecting at all!
Joe
@Joe: It makes sense to me, because it requires N calls to the stream output function, which probably has a good bit of logic inside it. If you build the output in a temporary buffer, you can (if necessary) get it down to a small constant number of method calls and a pretty tight inner loop, which seems much more efficient. If speed is what really matters here, it won't get much faster than that.
Charlie
Just goes to show you don't know until you actually code and it try.This "could" be faster on a different machine or compiler, it all depends on the overhead of printing single characters versus complete strings.
KPexEA