views:

545

answers:

10

I need to process a list of files. The processing action should not be repeated for the same file. The code I am using for this is -

using namespace std;

vector<File*> gInputFileList; //Can contain duplicates, File has member sFilename
map<string, File*> gProcessedFileList; //Using map to avoid linear search costs

void processFile(File* pFile)
{
    File* pProcessedFile = gProcessedFileList[pFile->sFilename];
    if(pProcessedFile != NULL)
        return; //Already processed

    foo(pFile); //foo() is the action to do for each file
    gProcessedFileList[pFile->sFilename] = pFile;
}

void main()
{
    size_t n= gInputFileList.size(); //Using array syntax (iterator syntax also gives identical performance)
    for(size_t i=0; i<n; i++){
        processFile(gInputFileList[i]);
    }
}

The code works correctly, but...

My problem is that when the input size is 1000, it takes 30 minutes - HALF AN HOUR - on Windows/Visual Studio 2008 Express. For the same input, it takes only 40 seconds to run on Linux/gcc!

What could be the problem? The action foo() takes only a very short time to execute, when used separately. Should I be using something like vector::reserve for the map?

EDIT, EXTRA INFORMATION

What foo() does is: 1. it opens the file 2. reads it into memory 3. closes the file 4. the contents of the file in memory is parsed 5. it builds a list of tokens; I'm using a vector for that.

Whenever I break the program (while running the program with the 1000+ files input set): the call-stack shows that the program is in the middle of a std::vector add.

+2  A: 

I very very strongly doubt that your performance problem is coming from the STL containers.

Try to eliminate (comment out) the call to foo(pFile) or any other method which touches the filesystem. Although running foo(pFile) once may appear fast, running it on 1000 different files (especially on Windows filesystems, in my experience) could turn out to be much slower (e.g. because of filesystem cache behaviour.)

EDIT

Your initial post was claiming that BOTH debug and release builds were affected. Now you are withdrawing that claim.

Be aware that in DEBUG builds:

  1. the STL implementation performs extra checks and assertions
  2. heap operations (memory allocation etc.) perform extra checks and assertions; moreover, under debug builds the low-fragmentation heap is disabled (up to a 10x overall slowdown in memory allocation)
  3. no code optimizations are performed, which may result in further STL performance degradation (STL relying many a time heavily on inlining, loop unwinding etc.)

With 1000 iterations you are probably not affected by the above (not at the outer loop level at least) unless you use STL/the heap heavily INSIDE foo().

vladr
The files that have been processed already are stored in the map, so that step is correct.
Dennis Zickefoose
No, gProcessedFileList[] is a cache of processed files to avoid reprocessing the same one repeatedly.
kibibu
i will check whether the files are getting closed.
sonofdelphi
1. at the time of the initial post I had unknowingly checked the behavior with the debug build thinking it was the release build. I am sorry about that. 2. foo() also uses STL quite heavily...as I've indicated in the updated version of the question.
sonofdelphi
+2  A: 

Break into the program using the debugger at a random time, and the chances are very high that the stack trace will tell you where it's spending the time.

bk1e
have updated this in the question. it is in the middle of a vector add.
sonofdelphi
+2  A: 

I would approach it like any performance problem. This means: profiling. MSVC has a built-in profiler, by the way, so it may be a good chance to get familiar with it.

Eli Bendersky
The Express edition does not have a profiler built in, I don't think. Can you suggest alternatives?
sonofdelphi
@sonofdelphi: try to compile the same program on Windows with mingw and use `gprof`
Eli Bendersky
+1  A: 

I would be astounded if the performance issues you are seeing have anything at all to do with the map class. Doing 1000 lookups and 1000 insertion should take a combined time on the order of microseconds. What is foo() doing?

Marcelo Cantos
have updated the details of foo() in the question.
sonofdelphi
+1  A: 

Without knowing how the rest of the code fits in, I think the overall idea of caching processed files is a little flaky.

Try removing duplicates from your vector first, then process them all.

kibibu
+1  A: 

Try commenting each block or major operation to determine which part actually caused the difference in execution time in Linux and Windows. I also don't think it would be because of the STL map. The problem may be inside foo(). It may be in some file operation as it is the only thing I could think of that would be costly in this case.

You may insert clock() calls in between operations to get an idea of the execution time.

jasonline
+1  A: 

You say that when you break, you find yourself inside vector::add. You don't have a vector::add in the code you've shown us, so I suspect it's inside the foo function. Without seeing that code, it's going to be difficult to say what's up.

You might have inadvertently created a Shlemiel the Painter algorithm.

Mark Ransom
+1 for the link, interesting reading
Jamie Cook
I have not put the foo() code in the post; have only explained what it does.
sonofdelphi
"Shlemiel the Painter" is just a new name for what's already known (to me) as O(n2). But really, big-oh of n squared should be a complexity class in and of itself in the same manner as off-by-one is bug class in and of itself, because it's so common.
wilhelmtell
+15  A: 

In the Microsoft Visual Studio, there's a global lock when accessing the Standard C++ Library to protect from multi threading issue in Debug builds. This can cause big performance hits. For instance, our full test code runs on Linux/gcc in 50 minutes, whereas it needs 5 hours on Windows VC++2008. Note that this performance hit does not exist when compiling in Release mode, using the non-debug Visual C++ runtime.

Didier Trosset
i think this is it. i generated a new Release build and tried it. The performance improved to 240 seconds...Thanks for sharing the experience. Is there a workaround for this Debug STL locking?
sonofdelphi
look at http://blogs.msdn.com/vcblog/archive/2009/06/23/stl-performance.aspx and comments
Francesco
I would also add that even in release mode iterators in VC do extra checking that the GCC isn't doing. google _SECURE_SCL for more info.
stonemetal
+1  A: 

You can improve things somewhat if you ditch your map and partition your vector instead. This implies reordering the input files list. It also means you have to find a way of quickly determining if a file has been processed already, possibly by holding a flag in the File class. If it's ok to reorder the files list and if you can store that dirty flag in the File object then you can improve performance from O(n log m) to O(n), for n total files and m processed files.

#include <algorithm>
#include <functional>
// ...
vector<File*>::iterator end(partition(inputfiles.begin(), inputfiles.end(),
                                      not1(mem_fun(&File::is_processed))));
for_each(inputfiles.begin(), end, processFile);

If you can't reorder the files list or if you can't change the File object then you can switch the map with a vector and shadow each file in the input files list with a flag in the second vector at the same index. This will cost you O(n) space but will give you O(1) check for dirty state.

vector<File*> processed(inputfiles.size(), 0);

for( vector<File*>::size_type i(0); i != inputfiles.size(); ++i ) {
    if( processed[i] != 0 ) return;  // O(1)
    // ...
    processed[i] = inputfiles[i];    // O(1)
}

But be careful: You're dealing with two distinct pointers pointing at the same address, and that's the case for each pair of pointers in the two containers. Make sure one and only one pointer owns the pointee.

I don't expect either of these to yield a solution for that performance hit, but nevertheless.

wilhelmtell
A: 

If you are doing most of your work in linux then I strongly strongly suggest you only ever compile to release mode in windows. That makes life much easier, especially considering all the windows inflexible library handling headaches.

brian