views:

157

answers:

6

I have a program that loads a file (anywhere from 10MB to 5GB) a chunk at a time (ReadFile), and for each chunk performs a set of mathematical operations (basically calculates the hash).

After calculating the hash, it stores info about the chunk in an STL map (basically <chunkID, hash>) and then writes the chunk itself to another file (WriteFile).

That's all it does. This program will cause certain PCs to choke and die. The mouse begins to stutter, the task manager takes > 2 min to show, ctrl+alt+del is unresponsive, running programs are slow.... the works.

I've done literally everything I can think of to optimize the program, and have triple-checked all objects.

What I've done:

  • Tried different (less intensive) hashing algorithms.
  • Switched all allocations to nedmalloc instead of the default new operator
  • Switched from stl::map to unordered_set, found the performance to still be abysmal, so I switched again to Google's dense_hash_map.
  • Converted all objects to store pointers to objects instead of the objects themselves.
  • Caching all Read and Write operations. Instead of reading a 16k chunk of the file and performing the math on it, I read 4MB into a buffer and read 16k chunks from there instead. Same for all write operations - they are coalesced into 4MB blocks before being written to disk.
  • Run extensive profiling with Visual Studio 2010, AMD Code Analyst, and perfmon.
  • Set the thread priority to THREAD_MODE_BACKGROUND_BEGIN
  • Set the thread priority to THREAD_PRIORITY_IDLE
  • Added a Sleep(100) call after every loop.

Even after all this, the application still results in a system-wide hang on certain machines under certain circumstances.

Perfmon and Process Explorer show minimal CPU usage (with the sleep), no constant reads/writes from disk, few hard pagefaults (and only ~30k pagefaults in the lifetime of the application on a 5GB input file), little virtual memory (never more than 150MB), no leaked handles, no memory leaks.

The machines I've tested it on run Windows XP - Windows 7, x86 and x64 versions included. None have less than 2GB RAM, though the problem is always exacerbated under lower memory conditions.

I'm at a loss as to what to do next. I don't know what's causing it - I'm torn between CPU or Memory as the culprit. CPU because without the sleep and under different thread priorities the system performances changes noticeably. Memory because there's a huge difference in how often the issue occurs when using unordered_set vs Google's dense_hash_map.

What's really weird? Obviously, the NT kernel design is supposed to prevent this sort of behavior from ever occurring (a user-mode application driving the system to this sort of extreme poor performance!?)..... but when I compile the code and run it on OS X or Linux (it's fairly standard C++ throughout) it performs excellently even on poor machines with little RAM and weaker CPUs.

What am I supposed to do next? How do I know what the hell it is that Windows is doing behind the scenes that's killing system performance, when all the indicators are that the application itself isn't doing anything extreme?

Any advice would be most welcome.

+1  A: 

Some things to check:

  • Antivirus software. These often scan files as they're opened to check for viruses. Is your delay occuring before any data is read by the application?
  • General system performance. Does copying the file using Explorer also show this problem?
  • Your code. Break it down into the various stages. Write a program that just reads the file, then one that reads and writes the files, then one that just hashes random blocks of ram (i.e. remove the disk IO part) and see if any particular step is problematic. If you can get a profiler then use this as well to see if there any slow spots in your code.

EDIT

More ideas. Perhaps your program is holding on to the GDI lock too much. This would explain everything else being slow without high CPU usage. Only one app at a time can have the GDI lock. Is this a GUI app, or just a simple console app?

You also mentioned RtlEnterCriticalSection. This is a costly operation, and can hang the system quite easily, i.e. mismatched Enters and Leaves. Are you multi-threading at all? Is the slow down due to race conditions between threads?

Skizz
Computer Guru
It's actually a library - it has a test GUI application and is integrated into a service for realworld use. Both result in general system unresponsiveness.The multithreading is clean, the problem also occurred when it was implemented as a single-threaded (CriticalSection-free) library. Contention between the threads results in slowdown of *my application* but not the OS. Critical Sections are usermode (not Kernel Mode) objects, and will not affect *system* performance.
Computer Guru
+3  A: 

I know you said you had monitored memory usage and that it seems minimal here, but the symptoms sound very much like the OS thrashing like crazy, which would definitely cause general loss of OS responsiveness like you're seeing.

When you run the application on a file say 1/4 to 1/2 the size of available physical memory, does it seem to work better?

What I suspect may be happening is that Windows is "helpfully" caching your disk reads into memory and not giving up that cache memory to your application for use, forcing it to go to swap. Thus, even though swap use is minimal (150MB), it's going in and out constantly as you calculate the hash. This then brings the system to its knees.

Mark B
This has a gut feel of being in the neighborhood.
Paul Nathan
That's exactly what I first thought, and hence my mention of page faults. I have no way of knowing if this is truly the case though. I'm already caching disk reads/writes myself, and eating through the buffers relatively-slowly enough for Windows not to do so (or that's what one would think!). Smaller files mean fewer chunks which mean fewer entries in the Map which mean fewer calculations which means faster finds, an emptier CPU cache, etc. Basically, just because the problem does not happen with smaller files doesn't mean it's the OS swapping - do you know how to find out?
Computer Guru
+1 Look into this, back when i used windows as my day to day OS i would constantly have issues with its disk caching when working with large data files. Very often it would page my applications out to virtual memory in order to use to the ram as disk cache. I believe there are some tweeks to adjust this but i don't recall what they were sorry, i was using winXP,
Mark
One question, you never mentioned (or I missed it) what CPU usage was without the sleep - was it CPU pegged? If so, then that implies my theory is incorrect. If it's 5-10% that smells of I/O blocking and almost certainly isn't the hashing or map manipulation directly.Compared to the disk I/O I would guess that the hash/map should be relatively small, so there should be certain file sizes for which your program completes as quickly as a copy. After a certain size threshold two things should happen: Page faults drastically increase and system performance decreases.
Mark B
Mark B, CPU usage for initial backups is at 30% (the process is multi-threaded, so that's a real 30%) and for incremental backups is > 90%. Not much room for IO overhead?
Computer Guru
That sure doesn't leave much room for I/O. As a test, are you able to just disable the write-back-to-disk portion of the code (comment it out?) and see if it behaves better? That would help us determine at least if the write cache is getting overwhelmed.
Mark B
If your thread priorities are as listed in the OP then the activity should easily be getting bumped for IO work or by most other processes when needed. However if windows has paged out other processes memory to the pagefile all the cpu time in the world won't improve the responsiveness. The fact that you have the issue with sleeps inserted makes me think its not a cpu issue at all but a memory issue.
Mark
A: 

It sounds like you're poking around fixing things without knowing what the problem is. Take stackshots. They will tell you what your program is doing when the problem occurs. It might not be easy to get the stackshots if the problem occurs on other machines where you cannot use an IDE or a stack sampler. One possibility is to kill the app and get a stack dump when it's acting up. You need to reproduce the problem in an environment where you can get a stack dump.


Added: You say it performs well on OSX and Linux, and poorly on Windows. I assume the ratio of completion time is some fairly large number, like 10 or 100, if you've even had the patience to wait for it. I said this in the comment, but it is a key point. The program is waiting for something, and you need to find out what. It could be any of the things people mentioned, but it is not random.

Every program, all the time while it runs, has a call stack consisting of a hierarchy of call instructions at specific addresses. If at a point in time it is calculating, the last instruction on the stack is a non-call instruction. If it is in I/O the stack may reach into a few levels of library calls that you can't see into. That's OK. Every call instruction on the stack is waiting. It is waiting for the work it requested to finish. If you look at the call stack, and look at where the call instructions are in your code, you will know what your program is waiting for.

Your program, since it is taking so long to complete, is spending nearly all of its time waiting for something to finish, and as I said, that's what you need to find out. Get a stack dump while it's being slow, and it will give you the answer. The chance that it will miss it is 1/the-slowness-ratio.

Sorry to be so elemental about this, but lots of people (and profiler makers) don't get it. They think they have to measure.

Mike Dunlavey
I don't think this'll help. As I mention, I've profiled extensively under VS2010 and CodeAnalyst, both of which take "stackshots" as you call them. They show exactly what I expect: the CPU spends most of its time calculating the hash, some time finding entries in the STL map, some time entering Critical Sections, and some time reading/writing. All the methods people are describing here are to analyze *application* slowdown. But I know why *my* program is slow, I want to know why it's slowing down the *system* though.
Computer Guru
@Computer-Guru: Stackshots are different. VS2010 (last time I looked) turns off sampling during I/O, so if the problem happens during I/O *it won't see it*. CA looks probably similar. Your problem (I bet) is I/O, so where the CPU spends time is irrelevant. What you want to know is, when it is stuck, which specific lines of code are on the call stack waiting for it to finish what it is having trouble finishing. If your app takes much longer to finish, then 1 sample should show why, even if you think the problem is elsewhere.
Mike Dunlavey
But, Mike, you say "when it is stuck" except the app *doesn't* get stuck - Windows does. My problem is that I don't know how to measure cause-and-effect when the cause is in the app and the effect is on Windows?
Computer Guru
@Computer-Guru: I understand your frustration, but I've done lots of cases like this, and the idea of "measuring" gets in the way. When you're not having this problem the program completes in some amount of time, like say, 1 minute. When you are having the problem, it completes in, say, 60 minutes. That means for 59 out of every 60 wall-clock milliseconds, it is hanging waiting for something it asked for to finish. So a single stack sample (at a random wall-clock time, not CPU time) will show in detail what in your program is driving Windows crazy, with only a 1/60 chance of missing it. Help?
Mike Dunlavey
... In other words, the program is always waiting for something. (If it weren't, it would be finished.) We know its actual CPU work takes 1 minute. So in the remaining 59 minutes, *what the heck is it waiting for?*. Don't bend your mind about measuring, just ask that simple question, with a single stack sample.
Mike Dunlavey
+1  A: 

XPerf is your guide here - watch the PDC Video about it, and then take a trace of the misbehaving app. It will tell you exactly what's happening throughout the system, it is extremely powerful.

Paul Betts
+1  A: 

I like the disk-caching/thrashing suggestions, but if that's not it, here are some scattershot suggestions:

What non-MSVC libraries, if any, are you linking to?

Can your program be modified (#ifdef'd) to run without a GUI? Does the problem occur?

You added ::Sleep(100) after each loop in each thread, right? How many threads are you talking about? A handful or hundreds? How long does each loop take, roughly? What happens if you make that ::Sleep(10000)?

Is your program perhaps doing something else that locks a limited resources (ProcExp can show you what handles are being acquired ... of course you might have difficulty with ProcExp not responding:-[)

Are you sure CriticalSections are userland-only? I recall that was so back when I worked on Windows (or so I believed), but Microsoft could have modified that. I don't see any guarantee in the MSDN article Critical Section Objects (http://msdn.microsoft.com/en-us/library/ms682530%28VS.85%29.aspx) ... and this leads me to wonder: Anti-convoy locks in Windows Server 2003 SP1 and Windows Vista

Hmmm... presumably we're all multi-processor now, so are you setting the spin count on the CS?

How about running a debugging version of one of these OSes and monitoring the kernel debugging output (using DbgView)... possibly using the kernel debugger from the Platform SDK ... if MS still calls it that?

I wonder whether VMMap (another SysInternal/MS utility) might help with the Disk caching hypothesis.

LarryW
+1  A: 

It turns out that this is a bug in the Visual Studio compiler. Using a different compiler resolves the issue entirely.

In my case, I installed and used the Intel C++ Compiler and even with all optimizations disabled I did not see the fully-system hang that I was experiencing w/ the Visual Studio 2005 - 2010 compilers on this library.

I'm not certain as to what is causing the compiler to generate such broken code, but it looks like we'll be buying a copy of the Intel compiler.

Computer Guru