views:

81

answers:

3

I have an application that has multiple threads processing work from a todo queue. I have no influence over what gets into the queue and in what order (it is fed externally by the user). A single work item from the queue may take anywhere between a couple of seconds to several hours of runtime and should not be interrupted while processing. Also, a single work item may consume between a couple of megabytes to around 2GBs of memory. The memory consumption is my problem. I'm running as a 64bit process on a 8GB machine with 8 parallel threads. If each of them hits a worst case work item at the same time I run out of memory. I'm wondering about the best way to work around this.

  1. plan conservatively and run 4 threads only. The worst case shouldn't be a problem anymore, but we waste a lot of parallelism, making the average case a lot slower.
  2. make each thread check available memory (or rather total allocated memory by all threads) before starting with a new item. Only start when more than 2GB memory are left. Recheck periodically, hoping that other threads will finish their memory hogs and we may start eventually.
  3. try to predict how much memory items from the queue will need (hard) and plan accordingly. We could reorder the queue (overriding user choice) or simply adjust the number of running worker threads.
  4. more ideas?

I'm currently tending towards number 2 because it seems simple to implement and solve most cases. However, I'm still wondering what standard ways of handling situations like this exist? The operating system must do something very similar on a process level after all...

regards,

Sören
A: 

Difficult to propose solutions without knowing exactly what you're doing, but how about considering:

  1. See if your processing algorithm can access the data in smaller sections without loading the whole work item into memory.
  2. Consider developing a service-based solution so that the work is carried out by another process (possibly a web service). This way you could scale the solution to run over multiple servers, perhaps using a load balancer to distribute the work.
  3. Are you persisting the incoming work items to disk before processing them? If not, they probably should be anyway, particularly if it may be some time before the processor gets to them.
  4. Is the memory usage proportional to the size of the incoming work item, or otherwise easy to calculate? Knowing this would help to decide how to schedule processing.

Hope that helps?!

Tim Croydon
1) working on it. complex problem due to random access patterns. 2) we already have several such boyes running. The described problem is actually our smallest work unit. 3) they come from a database, so yes they are persisted. 4) it does depend on the item. But the relationship is a bit involved and non linear. Will look into it some more though.thanks!
BuschnicK
boyes == boxes above. A bit hard to type with my fingers in band-aid from our last climbing session ;-)
BuschnicK
A: 

So your current worst-case memory usage is 16GB. With only 8GB of RAM, you'd be lucky to have 6 or 7GB left after the OS and system processes take their share. So on average you're already going to be thrashing memory on a moderately loaded system. How many cores does the machine have? Do you have 8 worker threads because it is an 8-core machine?

Basically you can either reduce memory consumption, or increase available memory. Your option 1, running only 4 threads, under-utilitises the CPU resources, which could halve your throughput - definitely sub-optimal.

Option 2 is possible, but risky. Memory management is very complex, and querying for available memory is no guarantee that you will be able to go ahead and allocate that amount (without causing paging). A burst of disk I/O could cause the system to increase the cache size, a background process could start up and swap in its working set, and any number of other factors. For these reasons, the smaller the available memory, the less you can rely on it. Also, over time memory fragmentation can cause problems too.

Option 3 is interesting, but could easily lead to under-loading the CPU. If you have a run of jobs that have high memory requirements, you could end up running only a few threads, and be in the same situation as option 1, where you are under-loading the cores.

So taking the "reduce consumption" strategy, do you actually need to have the entire data set in memory at once? Depending on the algorithm and the data access pattern (eg. random versus sequential) you could progressively load the data. More esoteric approaches might involve compression, depending on your data and the algorithm (but really, it's probably a waste of effort).

Then there's "increase available memory". In terms of price/performance, you should seriously consider simply purchasing more RAM. Sometimes, investing in more hardware is cheaper than the development time to achieve the same end result. For example, you could put in 32GB of RAM for a few hundred dollars, and this would immediately improve performance without adding any complexity to the solution. With the performance pressure off, you could profile the application to see just where you can make the software more efficient.

gavinb
I absolutely agree on the "reduce comsumption" point - we are working very hard on this and counting individual bytes already. Sequential/lazy loading is something we want to do but is very hard because of random access patterns. It is in the pipeline but far off for now... It is an 8 core machine, yes. I'll ask about buying more memory, but it's not just one machine but lots of them - we parlallelize across boxes as well.
BuschnicK
If you described the nature of your processing algorithm, it would be easier to offer more relevant suggestions. One thing that occurred to me is that you might be able to do an initial pass over your data to build an index, then use this information to optimize data loading and access strategies during processing.
gavinb
The data consists of disassembled pairs of binaries, i.e. callgraphs and flowgraphs. We are doing structural comparisons between these to measure similarity. Graph matching is an NP hard problem thus the complex access patterns on the data. Basically, I'm trying to optimize throughput for the diffing back end of vxclass (http://www.zynamics.com/vxclass.html)
BuschnicK
A very interesting problem. Given the algorithmic complexity of the problem, it really sounds like investing a few hundred dollars (or equivalent) in additional RAM to allow the machines to run without swapping and still keep all the cores busy, is the most cost-effective route. It will almost certainly be cheaper than weeks of developer time attempting to optimise the system, with no guarantee of success. Of course, these difficult problems are what keeps us going, so I'm sure you want to pursue optimizing the algorithm too!
gavinb
A: 

I have continued the discussion on Herb Sutter's blog and provoced some very helpful reader comments. Head over to Sutter's Mill if you are interested.

Thanks for all the suggestions so far!

Sören
BuschnicK