views:

350

answers:

8

I have a C++ program that will read in data from a binary file and originally I stored data in std::vector<char*> data. I have changed my code so that I am now using strings instead of char*, so that std::vector<std::string> data. Some changes I had to make was to change from strcmp to compare for example.

However I have seen my execution time dramatically increase. For a sample file, when I used char* it took 0.38s and after the conversion to string it took 1.72s on my Linux machine. I observed a similar problem on my Windows machine with execution time increasing from 0.59s to 1.05s.

I believe this function is causing the slow down. It is part of the converter class, note private variables designated with_ at the end of variable name. I clearly am having memory problems here and stuck in between C and C++ code. I want this to be C++ code, so I updated the code at the bottom.

I access ids_ and names_ many times in another function too, so access speed is very important. Through the use of creating a map instead of two separate vectors, I have been able to achieve faster speeds with more stable C++ code. Thanks to everyone!

Example NewList.Txt

2515    ABC 23.5    32  -99 1875.7  1  
1676    XYZ 12.5    31  -97 530.82  2  
279  FOO 45.5    31  -96  530.8  3  

OLD Code:

void converter::updateNewList(){
    FILE* NewList;
    char lineBuffer[100];
    char* id = 0;
    char* name = 0;

    int l = 0;
    int n;

    NewList = fopen("NewList.txt","r");
    if (NewList == NULL){
        std::cerr << "Error in reading NewList.txt\n";
        exit(EXIT_FAILURE);
    } 

    while(!feof(NewList)){
        fgets (lineBuffer , 100 , NewList); // Read line    
        l = 0;
        while (!isspace(lineBuffer[l])){
            l = l + 1;
        }

        id = new char[l];
        switch (l){
            case 1: 
                n = sprintf (id, "%c", lineBuffer[0]);
                break;
            case 2:
                n = sprintf (id, "%c%c", lineBuffer[0], lineBuffer[1]);
                break;
            case 3:
                n = sprintf (id, "%c%c%c", lineBuffer[0], lineBuffer[1], lineBuffer[2]);        
                break;
            case 4:
                n = sprintf (id, "%c%c%c%c", lineBuffer[0], lineBuffer[1], lineBuffer[2],lineBuffer[3]);
                break;
            default:
                n = -1;
                break;
        }
        if (n < 0){
            std::cerr << "Error in processing ids from NewList.txt\n";
            exit(EXIT_FAILURE);
        }

        l = l + 1;
        int s = l;
        while (!isspace(lineBuffer[l])){
            l = l + 1;
        }
        name = new char[l-s];
        switch (l-s){
            case 2:
                n = sprintf (name, "%c%c", lineBuffer[s+0], lineBuffer[s+1]);
                break;
            case 3:
                n = sprintf (name, "%c%c%c", lineBuffer[s+0], lineBuffer[s+1], lineBuffer[s+2]);
                break;
            case 4:
                n = sprintf (name, "%c%c%c%c", lineBuffer[s+0], lineBuffer[s+1], lineBuffer[s+2],lineBuffer[s+3]);
                break;
            default:
                n = -1;
                break;
        }
        if (n < 0){
            std::cerr << "Error in processing short name from NewList.txt\n";
            exit(EXIT_FAILURE);
        }


        ids_.push_back ( std::string(id) );
        names_.push_back(std::string(name));
    }

    bool isFound = false;
    for (unsigned int i = 0; i < siteNames_.size(); i ++) {
        isFound = false;
        for (unsigned int j = 0; j < names_.size(); j ++) {
            if (siteNames_[i].compare(names_[j]) == 0){
                isFound = true;
            }
        }
    }

    fclose(NewList);
    delete [] id;
    delete [] name;
}

C++ CODE

void converter::updateNewList(){
    std::ifstream NewList ("NewList.txt");

    while(NewList.good()){
        unsigned int id (0);
        std::string name;

        // get the ID and name
        NewList >> id >> name;

        // ignore the rest of the line
        NewList.ignore( std::numeric_limits<std::streamsize>::max(), '\n');

        info_.insert(std::pair<std::string, unsigned int>(name,id));

    }

    NewList.close();
}

UPDATE: Follow up question: Bottleneck from comparing strings and thanks for the very useful help! I will not be making these mistakes in the future!

+2  A: 

Resize vector to large enough size before you start populating it. Or, use pointers to strings instead of strings.

The thing is that the strings are being copied each time the vector is being auto-resized. For small objects such as pointers this cost nearly nothing, but for strings the whole string is copied in full.

And id and name should be string instead of char*, and be initialized like this (assuming that you still use string instead of string*):

id = string(lineBuffer, lineBuffer + l);
...
name = string(lineBuffer + s, lineBuffer + s + l);
...
ids_.push_back(id);
names_.push_back(name);
Dialecticus
Vector was there prior to adding string. He asks why strings are slower. using pointer to strings is not that boost either, because strings use shallow copy.
Andrey
well, not sure about is string copy deep or shallow
Andrey
My bad. Now that OP added code I see plenty of room for improvement and reducing memory leaks.
Dialecticus
@Dialectius and how would I get these improvements?
Elpezmuerto
I added some code. Instead of new char[], switch and sprintf just create a string. And when you omit new char[] you will also omit memory leak that current code exhibits. And it might still be true that using pointer to string in vector instead of string would speed things up.
Dialecticus
+1  A: 

Before fixing something make sure that it is bottleneck. Otherwise you are wasting your time. Plus this sort of optimization is microoptimization. If you are doing microoptimization in C++ then consider using bare C.

Andrey
and in case you decide to use bare C, http://stackoverflow.com/questions/2371292/buffered-reading-from-stdin-using-fread-in-c might help.
N 1.1
No need to go "all C". Having a `std::vector<char *>` is a good optimization, if this is really the bottleneck.
paercebal
@paercebal i really doubt that it can be bottleneck
Andrey
@Andrey: My point was more about the "all C" way. If the suspensions of your shiny car is not efficient enough, no need to change the car. Just change the suspension. In C++, just find the code that is the bottleneck, wrap it with a safe interface, and hide inside the less safe but more efficient code. As an e.g., I coded a component whose one struct was a bottleneck. The struct had a std::string and an int, and it was copied a lot. I made it a class to hide inside a char[32] and an int, and the copy method used a memcpy. But for the rest of the component, it was still a full, safe C++ object.
paercebal
A: 

You can use a profiler to find out where your code consumes most time. If you are for example using gcc, you can compile your program with -pg. When you run it, it saves profiling results in a file. You can the run gprof on the binary to get human readable results. Once you know where most time is consumed you can post that piece of code for further questions.

Helmut
+1  A: 

Except for std::string, this is a C program.

Try using fstream, and use the profiler to detect the bottle neck.

VJo
+5  A: 

My guess it that it should be tied to the vector<string>'s performance

About the vector

A std::vector works with an internal contiguous array, meaning that once the array is full, it needs to create another, larger array, and copy the strings one by one, which means a copy-construction and a destruction of string which had the same contents, which is counter-productive...

To confirm this easily, then use a std::vector<std::string *> and see if there is a difference in performance.

If this is the case, they you can do one of those four things:

  1. if you know (or have a good idea) of the final size of the vector, use its method reserve() to reserve enough space in the internal array, to avoid useless reallocations.
  2. use a std::deque, which works almost like a vector
  3. use a std::list (which doesn't give you random access to its items)
  4. use the std::vector<char *>

About the string

Note: I'm assuming that your strings\char * are created once, and not modified (through a realloc, an append, etc.).

If the ideas above are not enough, then...

The allocation of the string object's internal buffer is similar to a malloc of a char *, so you should see little or no differences between the two.

Now, if your char * are in truth char[SOME_CONSTANT_SIZE], then you avoid the malloc (and thus, will go faster than a std::string).

Edit

After reading the updated code, I see the following problems.

  1. if ids_ and names_ are vectors, and if you have the slightest idea of the number of lines, then you should use reserve() on ids_ and and names_
  2. consider making ids_ and names_ deque, or lists.
  3. faaNames_ should be a std::map, or even a std::unordered_map (or whatever hash_map you have on your compiler). Your search currently is two for loops, which is quite costly and inneficient.
  4. Consider comparing the length of the strings before comparing its contents. In C++, the length of a string (i.e. std::string::length()) is a zero cost operation)
  5. Now, I don't know what you're doing with the isFound variable, but if you need to find only ONE true equality, then I guess you should work on the algorithm (I don't know if there is already one, see http://www.cplusplus.com/reference/algorithm/), but I believe this search could be made a lot more efficient just by thinking on it.

Other comments:

  1. Forget the use of int for sizes and lengths in STL. At very least, use size_t. In 64-bit, size_t will become 64-bit, while int will remain 32-bits, so your code is not 64-bit ready (in the other hand, I see few cases of incoming 8 Go strings... but still, better be correct...)

Edit 2

The two (so called C and C++) codes are different. The "C code" expects ids and names of length lesser than 5, or the program exists with an error. The "C++ code" has no such limitation. Still, this limitation is ground for massive optimization, if you confirm names and ids are always less then 5 characters.

paercebal
That is assuming that this issue is with the old C++ standard, not the C++0x and a compiler that already implemented move semantics.
Let_Me_Be
+1 for answering the question. And yes, the extra malloc/free are probably the source of his slow down. Toss in the memcpy required to initialize/copy the strings for good measure.
Torlack
Let_Me_Be: There is still the cost of creating the initial string to be placed into the vector.
Torlack
@Torlack: The initial string does not need to cost anything: In VC++ 2008 STL implementation, any string less than 16 (or 15?) chars is directly embedded in the string class, not allocated with `new`. Meaning that such string (and most "trivial" strings are within this size) cost as much as a stack char[15].
paercebal
@Torlack: Fact is a "C initial `char *`" would be as costly to create and put in a vector than a "C++0x initial `std::string`". Still, I fail to see how a memcpy would help here, as C++ object are usually not memcpy-able (this is the province of the copy constructor, which, truly, can be less effective), but as written by Let_Me_Be, the C++0x' `std::move` is the C++ equivalent of C's `memmove`, but working on C++ objects, and used in the code, can produce miracles.
paercebal
@Paercebal, you can assume that the name and ids are always less than 5 characters
Elpezmuerto
I am trying to use `std::vector<std::string *>` but failing. Why doesn't declaring `std::vector<std::string*> names_;` and then `names_.push_back(line.substr(0,l))` or `names_.push_back(*line.substr(0,l))` not work?
Elpezmuerto
@Elpezmuerto : Instead of `push_back(line(0, l))`, you should try instead `push_back(new std::string(line, 0, l))`. Note that after use, you'll have to free those strings.
paercebal
@Elpezmuerto : Anyway, my late tests demonstrated the C++ code I did retrieve was FASTER on a gcc 4.4.5 than the C counterpart, and that reserving the vectors decreased the execution time by ~20%. But the allocation time (new/delete) of `char *` in the "C code" seemed to have little or no consequences. But I have still to work on a real test implementation (your "C++ code" was changed while I still worked on the version it replaced).
paercebal
@Paercebal, thanks a bunch! Ya I changed to using a `map` over `vector` because a the significant performance increase in a different function: Bottleneck from comparing strings (stackoverflow.com/q/3992548/363829)
Elpezmuerto
@Elpezmuerto: You're welcome... :-) ...
paercebal
+1  A: 

You can try to reserve a number of vector values in order to reduce the number of allocations (which are costly), as said Dialecticus (probably from the ancient Roma?).

But there is something that may deserve some observation: how do you store the strings from the file, do you perform concatenations etc...

In C, strings (which do not exist per say - they don't have a container from a library like the STL) need more work to deal with, but at least we know what happens clearly when dealing with them. In the STL, each convenient operation (meaning requiring less work from the programmer) may actually require a lot of operations behind the scene, within the string class, depending on how you use it.

So, while the allocations / freeings are a costly process, the rest of the logic, especially the strings process, may / should probably be looked at as well.

ring0
+1  A: 

I believe the main issue here is that your string version is copying things twice -- first into dynamically allocated char[] called name and id, and then into std::strings, while your vector<char *> version probably does not do that. To make the string version faster, you need to read directly into the strings and get rid of all the redundant copies

Chris Dodd
If I want to read directly into the string, I must use `ifstream`. I don't think I can use `fget` to do that?
Elpezmuerto
You can start with a blank string, use `.resize()` to set the size/capacity, and then write directly into the string with the overloaded `operator[]` if you want something that is as much like a `char` buffer as possible. Or you can create an output iterator that writes into a string and essentially treat the string as a bufffer you write into.
Chris Dodd
+1  A: 

streams take care of a lot of the heavy lifting for you. Stop doing it all yourself, and let the library help you:

void converter::updateNewList(){
    std::ifstream NewList ("NewList.txt");

    while(NewList.good()){
        int id (0);
        std::string name;

        // get the ID and name
        NewList >> id >> name;

        // ignore the rest of the line
        NewList.ignore( numeric_limits<streamsize>::max(), '\n');

        ids_.push_back (id);
        names_.push_back(name);
    }

    NewList.close();
}

There's no need to do the whitespace-tokenizing manually.

Also, you may find this site a helpful reference: http://www.cplusplus.com/reference/iostream/ifstream/

Tim
Edit: `NewList.ignore( std::numeric_limits<std::streamsize>::max(), '\n');`
Elpezmuerto
+1 For showing me the light, is there anyway to have ids_ and names_ as string pointers? I access them alot of that is the bottleneck is my code
Elpezmuerto
Note that I changed your IDS to ints. It's somewhat wasteful to store numbers as strings. As for the bottleneck of accessing the strings... can you post some of the code you use to do that? If you're doing unnecessary copying, that could be the problem.
Tim
@Tim: Bottleneck from comparing strings (http://stackoverflow.com/q/3992548/363829)
Elpezmuerto
From that other site about the search, yes a map is definitely the way to go. That would change this loading method, but not too much. Instead of `push_back` for both vectors, load the map once with `your_map[name] = id;` The rest of the logic in this loading method could stay the same.
Tim