views:

2578

answers:

8

Hello,

Windows XP SP3. Core 2 Duo 2.0 GHz. I'm finding the boost::lexical_cast performance to be extremely slow. Wanted to find out ways to speed up the code. Using /O2 optimizations on visual c++ 2008 and comparing with java 1.6 and python 2.6.2 I see the following results.

Integer casting:

c++: 
std::string s ;
for(int i = 0; i < 10000000; ++i)
{
    s = boost::lexical_cast<string>(i);
}

java:
String s = new String();
for(int i = 0; i < 10000000; ++i)
{
    s = new Integer(i).toString();
}

python:
for i in xrange(1,10000000):
    s = str(i)

The times I'm seeing are

c++: 6700 milliseconds

java: 1178 milliseconds

python: 6702 milliseconds

c++ is as slow as python and 6 times slower than java.

Double casting:

c++:
std::string s ;
for(int i = 0; i < 10000000; ++i)
{
    s = boost::lexical_cast<string>(d);
}

java:
String s = new String();
for(int i = 0; i < 10000000; ++i)
{
    double d = i*1.0;
    s = new Double(d).toString();
}

python:
for i in xrange(1,10000000):
    d = i*1.0
    s = str(d)

The times I'm seeing are

c++: 56129 milliseconds

java: 2852 milliseconds

python: 30780 milliseconds

So for doubles c++ is actually half the speed of python and 20 times slower than the java solution!!. Any ideas on improving the boost::lexical_cast performance? Does this stem from the poor stringstream implementation or can we expect a general 10x decrease in performance from using the boost libraries.

+11  A: 

lexical_cast is more general than the specific code you're using in Java and Python. It's not surprising that a general approach that works in many scenarios (lexical cast is little more than streaming out then back in to and from a temporary stream) ends up being slower than specific routines.

(BTW, you may get better performance out of Java using the static version, Integer.toString(int). [1])

Finally, string parsing and deparsing is usually not that performance-sensitive, unless one is writing a compiler, in which case lexical_cast is probably too general-purpose, and integers etc. will be calculated as each digit is scanned.

[1] Commenter "stepancheg" doubted my hint that the static version may give better performance. Here's the source I used:

public class Test
{
    static int instanceCall(int i)
    {
        String s = new Integer(i).toString();
        return s == null ? 0 : 1;
    }

    static int staticCall(int i)
    {
        String s = Integer.toString(i);
        return s == null ? 0 : 1;
    }

    public static void main(String[] args)
    {
        // count used to avoid dead code elimination
        int count = 0;

        // *** instance

        // Warmup calls
        for (int i = 0; i < 100; ++i)
            count += instanceCall(i);

        long start = System.currentTimeMillis();
        for (int i = 0; i < 10000000; ++i)
            count += instanceCall(i);
        long finish = System.currentTimeMillis();
        System.out.printf("10MM Time taken: %d ms\n", finish - start);


        // *** static

        // Warmup calls
        for (int i = 0; i < 100; ++i)
            count += staticCall(i);

        start = System.currentTimeMillis();
        for (int i = 0; i < 10000000; ++i)
            count += staticCall(i);
        finish = System.currentTimeMillis();
        System.out.printf("10MM Time taken: %d ms\n", finish - start);
        if (count == 42)
            System.out.println("bad result"); // prevent elimination of count
    }
}

The runtimes, using JDK 1.6.0-14, server VM:

10MM Time taken: 688 ms
10MM Time taken: 547 ms

And in client VM:

10MM Time taken: 687 ms
10MM Time taken: 610 ms

Even though theoretically, escape analysis may permit allocation on the stack, and inlining may introduce all code (including copying) into the local method, permitting elimination of redundant copying, such analysis may take quite a lot of time and result in quite a bit of code space, which has other costs in code cache that don't justify themselves in real code, as opposed to microbenchmarks like seen here.

Barry Kelly
You won't get better performance using static Integer.toString(int) because JIT works well.
stepancheg
stepancheg - not necessarily true. See the little benchmark test I added.
Barry Kelly
Barry, run "instance" test after "static" test and you have "instance" toString work faster.
stepancheg
I'll have to let others do the modifications to the code as I don't want the sample to get much longer, but that's not what I saw even after I reversed the pair, put it inside a loop, and unrolled the loop three times in its inner body. With both client VM I see the static call being a steady 10+% faster, with the margin being less but still present server VM.
Barry Kelly
@Barry. Thanks for the suggestion on the static case, it is about 20% faster in my case.
Nathan
That's cumulative times over 9 iterations, i.e. 3x loop with 3x alternating pairs, with static running first. Sample timing: Totals: inst: 6174 ms, stat: 5483 ms
Barry Kelly
I don't quite get why you check for count being equal to 42. Can you please elaborate on it.
batbrat
@batbrat - the comparison stops the compiler from potentially eliminating the code in staticCall, and avoiding calling it, etc. Probably javac and the JVM won't eliminate it, but if I never did anything with the return value, and the code in the methods doesn't do anything with global state, then it is legal for the compiler to eliminate the code, since it would have no observable side-effect. By checking count, I force the compiler to at least evaluate the code at compile time (I know it won't do that though, too complex), or else leave the code in to be executed at runtime (what I want).
Barry Kelly
+1  A: 

As Barry said, lexical_cast is very general, you should use a more specific alternative, for example check out itoa (int->string) and atoi (string -> int).

Motti
Your answer does not explain, why lexical_cast slow.
stepancheg
+8  A: 

You could specialize lexical_cast for int and double types. Use strtod and strtol in your's specializations.

namespace boost {
template<>
inline int lexical_cast(const std::string& arg)
{
    char* stop;
    int res = strtol( arg.c_str(), &stop, 10 );
    if ( *stop != 0 ) throw_exception(bad_lexical_cast(typeid(int), typeid(std::string)));
    return res;
}
template<>
inline std::string lexical_cast(const int& arg)
{
    char buffer[65]; // large enough for arg < 2^200
    ltoa( arg, buffer, 10 );
    return std::string( buffer ); // RVO will take place here
}
}//namespace boost

int main(int argc, char* argv[])
{
    std::string str = "22"; // SOME STRING
    int int_str = boost::lexical_cast<int>( str );
    std::string str2 = boost::lexical_cast<std::string>( str_int );

    return 0;
}

This variant will be faster than using default implementation, because in default implementation there is construction of heavy stream objects. And it is should be little faster than printf, because printf should parse format string.

Kirill V. Lyadvinsky
This goes the wrong way. He wants int to string.
GMan
Added int to string specialization.
Kirill V. Lyadvinsky
Great idea! +1.
j_random_hacker
+29  A: 

Just to add info on Barry's and Motti's excellent answers:

Some background

Please remember Boost is written by the best C++ developers on this planet, and reviewed by the same best developers. If lexical_cast was so wrong, someone would have hacked the library either with criticism or with code.

I guess you missed the point of lexical_cast's real value...

Comparing apples and oranges.

In Java, you are casting an integer into a Java String. You'll note I'm not talking about an array of characters, or a user defined string. You'll note, too, I'm not talking about your user-defined integer. I'm talking about strict Java Integer and strict Java String.

In Python, you are more or less doing the same.

As said by other posts, you are, in essence, using the Java and Python equivalents of sprintf (or the less standard itoa).

In C++, you are using a very powerful cast. Not powerful in the sense of raw speed performance (if you want speed, perhaps sprintf would be better suited), but powerful in the sense of extensibility.

Comparing apples.

If you want to compare a Java Integer.toString method, then you should compare it with either C sprintf or C++ ostream facilities.

The C++ stream solution would be 6 times faster (on my g++) than lexical_boost, and quite less extensible:

inline void toString(const int value, std::string & output)
{
   // The largest 32-bit integer is 4294967295, that is 10 chars
   // On the safe side, add 1 for sign, and 1 for trailing zero
   char buffer[12] ;
   sprintf(buffer, "%i", value) ;
   output = buffer ;
}

The C sprintf solution would be 8 times faster (on my g++) than lexical_boost but a lot less safe:

inline void toString(const int value, char * output)
{
   sprintf(output, "%i", value) ;
}

Both solutions are either as fast or faster than your Java solution (according to your data).

Comparing oranges.

If you want to compare a C++ lexical_cast, then you should compare it with this Java pseudo code:

Source s ;
Target t = Target.fromString(Source(s).toString()) ;

Source and Target being of of whatever type you want, including built-in types like boolean or int, which is possible in C++ because we are using templates, here.

Extensibility? Is that a dirty word?

No, but it has a well known cost: When written by the same coder, general solutions to specific problems are usually slower than specific solutions written for their specific problems.

In the current case, in a naive viewpoint, lexical_cast will use the stream facilities to convert from a type A into a string stream, and then from this string stream into a type B.

This means that as long as your object can be output into a stream, and input from a stream, you'll be able to use lexical cast on it, without touching any single line of code.

So, what's the lexical_cast uses?

The lexical cast's main uses are:

  1. Ease of use (hey, a C++ cast that works for everything being a value!)
  2. Combining it with template heavy code, where your types are parametrized, and as such you don't want to deal with specifics, and you don't want to know the types.
  3. Still potentially relatively efficient, if you have basic template knowledge, as I will demonstrate below

The point 2 is very very important here, because it means we have one and only one interface/function to cast a value of a type into an equal or similar value of another type.

This is the real point you missed, and this is the point that costs in performance terms.

But it's so slooooooowwww!

If you want raw speed performance, remember you're dealing with C++, and that you have a lot of facilities to handle conversion efficiently, and still, keep the lexical_cast ease of use feature.

It took me some minutes to look at the lexical_cast source, and come with a viable solution. Add to your C++ code the following code:

#ifdef SPECIALIZE_BOOST_LEXICAL_CAST_FOR_STRING_AND_INT

namespace boost
{
   template<>
   std::string lexical_cast<std::string, int>(const int &arg)
   {
      // The largest 32-bit integer is 4294967295, that is 10 chars
      // On the safe side, add 1 for sign, and 1 for trailing zero
      char buffer[12] ;
      sprintf(buffer, "%i", arg) ;
      return buffer ;
   }
}

#endif

By enabling this specialization of lexical_cast for strings and ints (by defining the macro SPECIALIZE_BOOST_LEXICAL_CAST_FOR_STRING_AND_INT), my code went 5 time faster on my g++ compiler, which means, according to your data, its performance should be similar to Java's.

And it took me 10 minutes of looking at boost code, and write a remotely efficient and correct 32-bit version. And with some work, it could probably go faster and safer (if we had direct write access to the std::string internal buffer, we could avoid a temporary external buffer, for example).

paercebal
What actually happens inside lexical_cast so it is so slow?
stepancheg
Why not read the source, or Neil's explanation, and find out for yourself?
Meredith L. Patterson
@stepancheg : The "slowness" of the lexical_cast can be attributed to the stream object constructed to handle the transformation from Source into stream, and then stream into Target. This is *the* very general solution, suitable for all cases, but not always performance-oriented. The specialization above enable the user to enhance performance when needed (remember: "Premature Optimization is the root of all Evil").
paercebal
@ stepancheg, lexical_cast is implemented through iostream, as a component of the standard library, whose design was not meant only to be a fast. I wouldn't say it's slow, rather, if speed is your concern, you surely have other alternatives in c++, boost.spirit 2.1 to name just one.
t.g.
@paercebal: Thanks for the suggestions on the C solutions and the detailed explanation. While no performance guarentees are mandatory, it doesn't give me confidence in the efficiency of this library for casting purposes in a generic fashion (who knows it might be 100x slower for something else). Can you think of examples where lexical_cast would be within the ballpark (~2x) of the specialized case. If I told you that std::vector could be anywhere from 5x to 20x slower than the specific case, I bet you would have very low confidence in using the library.
Nathan
@Nathan : The problem is that we must balance safety, extensibility and speed. This means that we are not supposed to use an object outside its performance aim, and expect to have it perform best than another, more adapted object.
paercebal
@Nathan: If you need performance for your specific case, then you can achieve almost C-like raw performance (as I did) by specializing the function, and hiding dirty details inside an inline function (as I did). Now, you should not become obsessive for something like a simple lexical cast, unless it was marked as speed bottleneck of your application. In this case, you must understand what's going wrong (copies, allocations, etc.?), and only then you must write a feature best adapted to your needs.
paercebal
@Nathan: As for the vector being 20x slower than other processing, and all this kind of object comparison, it can be difficult to make comparisons in the same context (as you tried when comparing a "toString" with a "lexical cast"). Vectors can be quite efficient, but STL offers you a lot of containers, so you should choose the one adapted to your needs and algorithms. A good exercise would be to try a manipulation of vectors of integers in Java and in C++,, and explain the difference...
paercebal
I haven't profiled this, so I can't say for sure, but I would expect the cost of lexical_cast to come from two sources: 1) the need for string copying (std::stringstream maintains its own dynamically allocated buffer), and 2) the need for dynamic allocations (which are far slower in C/C++ than in managed languages)The IOStreams (including stringstream) has a lot of flaws. `lexical_cast` just wraps it, so there's little it can do to improve the situation *in general*.But how many `lexical_casts` are you performing? Does the performance really *matter*?
jalf
@jaif : "But how many lexical_casts are you performing? Does the performance really matter" : We do agree on this. ^_^ ...
paercebal
@paercebal: +1 for the excellent specialisation idea. And sure, people often worry about performance when they don't need to. But I question your statement that increased generality usually leads to decreased runtime perf -- that was true before templates arrived, but in cases where all the relevant info is available at compile time (as here) that is no longer true. (A well-known example: std::sort() with a functor argument is more general than *and at least as fast as* qsort() due to inlining.) I think the problem here is not generality itself, but (avoidable) inefficiencies in IOstreams.
j_random_hacker
@j_random_hacker : An counter example would be the existence of the EASTL library, used instead of the STL, to take into account the constraints of game development. Another example would be using a house class instead of a std::vector when you know that your class will at most contain, say, 16 items, your house class allocating the array on the stack (i.e. "T t[16] ;" instead of "T * t ;" If this array happens to be a performance bottleneck, your house class will outperform std::vector because it won't use new/delete... ^_^ ...
paercebal
I have to somewhat disagree. Yes, `lexical_cast` *is* slower than necessary. This has got **nothing** to do with its flexibility and extensibility. Rather, it’s due to the unfortunate fact that the string stream needs to be copied to a string rather than passing ownership of its content over to a string object. This is unfortunate, unnecessary and could be done better. It *is* arguably bad design – not necessarily by the Boost guys but rather by those designing the `stringstream` class without the option to relinquish ownership of its internal character buffer, arguably a poor choice indeed.
Konrad Rudolph
thanks a lot, we are using this in a project. but instead of `%f` we use `%.20g`, and a buffer size of 30. A good example why this is required is when you try to cast `-boost::numeric::bounds<double>::highest()`.On one particular machine the lexical cast is about 100 times slower when used in multithreadding (I do not know why).
martinus
@martinus: I heard there was something likes locales that are mutexed global variables. I never went into the code, but if this is true, it could explain the degradation of performance when using different iostream objects from different threads, and still have each use wait for another to end before beginning. This would be a classical "convoy" pattern.
paercebal
+5  A: 

What lexical cast is doing in your code can be simplified to this:

string Cast( int i ) {
    ostringstream os << i;
    return os.str();
}

There is unfortunately a lot going on every time you call Cast():

  • a string stream is created possibly allocating memory
  • operator << for integer i is called
  • the result is stored in the stream, possibly allocating memory
  • a string copy is taken from the stream
  • a copy of the string is (possibly) created to be returned.
  • memory is deallocated

Thn in your own code:

 s = Cast( i );

the assignment involves further allocations and deallocations are performed. You may be able to reduce this slightly by using:

string s = Cast( i );

instead.

However, if performance is really importanrt to you, you should considerv using a different mechanism. You could write your own version of Cast() which (for example) creates a static stringstream. Such a version would not be thread safe, but that might not matter for your specific needs.

To summarise, lexical_cast is a convenient and useful feature, but such convenience comes (as it always must) with trade-offs in other areas.

anon
+1, a static ostringstream is a good idea. It may be possible to squeeze even more performance out by deriving a new stream buffer class from stringbuf that uses fixed-size, preallocated memory, and supplying that as the ostringstream's buffer with rdbuf() -- but then you'd be making some work for yourself ;)
j_random_hacker
A: 

if speed is a concern, or you are just interested in how fast such casts can be with C++, there's an interested thread regarding it.

Boost.Spirit 2.1(which is to be released with Boost 1.40) seems to be very fast, even faster than the C equivalents(strtol(), atoi() etc. ).

t.g.
A: 

I believe that there is a design flaw somewhere if you have to convert an integer to a string. Can you please provide more insight into what situation you have that needs this conversion?

Peter Jansson
Here's an example: I'd like to add 2 numbers and write the result in a text box on a GUI window.
j_random_hacker
It is difficult to see how speed performance on converting an integer to a string could be a bottleneck in a situation where you want to write something to a GUI. It should be more than enough to write something on a GUI with a frequency of about 100 Hz. (We humans can only appreciate about 25 Hz.)
Peter Jansson
How about saving some numbers to a file which is based on XML?
macbirdie
Any I/O to a formatted file would usually be a bottleneck. I'd say skip that I/O in your design to begin with, if possible.
Peter Jansson
+3  A: 

Unfortunately I don't have enough rep yet to comment...

lexical_cast is not primarily slow because it's generic (template lookups happen at compile-time, so virtual function calls or other lookups/dereferences aren't necessary). lexical_cast is, in my opinion, slow, because it builds on C++ iostreams, which are primarily intended for streaming operations and not single conversions, and because lexical_cast must check for and convert iostream error signals. Thus:

  • a stream object has to be created and destroyed
  • in the string output case above, note that C++ compilers have a hard time avoiding buffer copies (an alternative is to format directly to the output buffer, like sprintf does, though sprintf won't safely handle buffer overruns)
  • lexical_cast has to check for stringstream errors (ss.fail()) in order to throw exceptions on conversion failures

lexical_cast is nice because (IMO) exceptions allow trapping all errors without extra effort and because it has a uniform prototype. I don't personally see why either of these properties necessitate slow operation (when no conversion errors occur), though I don't know of such C++ functions which are fast (possibly Spirit or boost::xpressive?).

Edit: I just found a message mentioning the use of BOOST_LEXICAL_CAST_ASSUME_C_LOCALE to enable an "itoa" optimisation: http://old.nabble.com/lexical_cast-optimization-td20817583.html. There's also a linked article with a bit more detail.

dhardy