views:

770

answers:

14

I have to write a not-so-large program in C++, using boost::thread.

The problem at hand, is to process a large (maybe thousands or tens of thousands. Hundreds and millons are a possibility as well) number of (possibly) large files. Each file is independent from another, and they all reside in the same directory. I´m thinking of using the multi threaded aproach, but the question is, how many threads should I use? I mean, what order of magnitude? 10, 500, 12400?

There are some synchronization issues, each thread should return a struct of values (which are accumulated for each file), and those are added to a "global" struct to get the overall data. I realize that some threads could "get hungry" because of synchronization, but if it's only an add operation, does it matter?

I was thinking of

for(each file f in directory){

    if (N < max_threads)//N is a static variable controlling amount of threads
         thread_process(f)
    else
       sleep()
}

This is in HP - UX, but I won't be able to test it often, since it's a remote and quite unaccessible server.

+1  A: 

How expensive the simplest thread is depends on the OS (you may also need to tune some OS parameters to get past a certain number of threads). At minimum each has its own CPU state (registers/flags incl. floating point) and stack as well as any thread-specific heap storage.

If each individual thread doesn't need too much distinct state, then you can probably get them pretty cheap by using a small stack size.

In the limit, you may end up needing to use a non-OS cooperative threading mechanism, or even multiplex events yourself using tiny "execution context' objects.

Just start with threads and worry about it later :)

wrang-wrang
+11  A: 

I'm not too sure about HP/UX, but in the Windows world, we use thread pools to solve this sort of problem. Raymond Chen wrote about this a while back, in fact...

The skinny of it is that I would generally not expect anything to scale well on a CPU-bound load if the number of threads is more than about 2x the number of CPU cores you have in the system. For I/O bound loads, you might be able to get away with more, depending on how fast your disk subsystem is, but once you reach about 100 or so, I would seriously consider changing the model...

Dave Markle
Too many threads will also mess up your IO. Disk head thrashing and buffer contention will overcome the benefits of having lots of processes running. I would stick with the tried and tested "cpu cores * 2" formula.
James Anderson
@James, true, true. But don't we all have one of those new fangled Intel SSDs? Oh, that's right. Newegg jacked up the price on them, so I couldn't afford it :(
Dave Markle
I would also add there is another side effect of too many threads, at least on *nix systems. The storage allocated to the threads will at some point be stolen from the buffer pool, slowing down the I/O even further.
James Anderson
Yes - if you set your stack size to 2 megabytes, then *each thread* has a 2 megabyte stack. That can balloon your process memory usage, leading to more paging and fragmentation, even if the stacks aren't all being thoroughly used. Java applications can typically have more threads fit into a small process address space because everything is on the heap, but a C or C++ app may require a deeper stack due to variables/buffers with automatic storage. I'm not sure if this applies to Win32 or not.
Tom
+4  A: 

Not to sound trite but you use as many threads as you need.

Basically you can draw a graph of the number of threads against the (real) time to completion. You can also draw one that is total threads to total thread time.

The first graph in particular will help you identify where the bottleneck in CPU power lies. At some point you will become either I/O bound (meaning the disk can't load the data fast enough) or the number of threads will become so large it will impact performance of the machine.

The second does happen. I saw one piece of code that ended up creating 30,000+ threads. It ended up being quicker by capping it to 1,000.

The other way to look at this is: how fast is fast enough? The point where I/O becomes a bottleneck is one thing but you may hit a point before that where it's "fast enough".

cletus
The point about 'fast enough' is a good one. I've avoided writing caching mechanisms for example by good use of threads to calculate the different values in parallel. Would the cache be faster? Sure, but recalculating saved disk space, and still completed in seconds anyway, and for a once-off operation, that's 'fast enough'.
Matthew Scharley
+3  A: 

Use a thread pool instead of creating a thread for each file. You can easily to adjust the number of threads once you write your solution. If the jobs are independed from each other, i'd say the number of threads should be equal to number of cores/cpus.

leiz
+3  A: 

The answer depends somewhat on how CPU intensive the processing you need to perform on each file is.

At one extreme where the processing time dominates the I/O time, the benefit that threading gives you is just the ability to take advantage of multiple cores (and possibly hyperthreading) to make use of the maximum available processing power of your CPU. In this case you'd want to aim for a number of worker threads roughly equal to the number of logical cores on the system.

At the other extreme where I/O is your bottleneck you aren't going to see all that much benefit from multiple threads since they will spend most of their time sleeping waiting for I/O to complete. In that case you'd want to focus on maximizing your I/O throughput rather than your CPU utilization. On a single unfragmented hard drive or a DVD where you were I/O bound having multiple threads would likely hurt performance since you'd get maximum I/O throughput from sequential reads on a single thread. If the drive is fragmented or you have a RAID array or similar then having multiple I/O requests in flight simultaneously might boost your I/O throughput since the controller may be able to intelligently rearrange them to make more efficient reads.

I think it might be helpful to view this as really two separate problems. One is how to get maximum I/O throughput for your file reads, the other is how to make maximum use of your CPU for processing the files. You would probably get optimal throughput by having a small number of I/O threads kicking off I/O requests and a pool of worker threads roughly equal to the number of logical CPU cores processing the data as it becomes available. Whether it is worth the effort to implement a more complex setup like that depends on where the bottlenecks are in your particular problem though.

mattnewport
I think you said it better than I did.
Mike Dunlavey
+3  A: 

This might be a bit too old school sounding but have you considered simply forking processes? It sounds like you have highly independent work units with a small aggregation of return data. A process model would also free up virtual address space (which might be tight if you're on a 32-bit machine) allowing each worker room to say mmap() the whole file being processed.

Jeff Mc
Actually I have, and havent discarded it yet. mmap() sounds attractive. However, i would have to find a way for processes to return process data to the "main" process, and i could have too many processes alive at the same time
Tom
+2  A: 

There are a lot of variables that will effect performance (OS, filesystem, hard drive speed vs CPU speed, data access patterns, how much processing is done on the data after it is read, etc).

So your best bet is to simply try a test run for every possible thread count, on a representative data set (a big one if possible, so that filesystem caching won't skew the results too badly), and record how long it takes each time. Start with a single thread, then try it again with two threads, and so on until you feel you have enough data points. At the end you should have data that graphs into a nice curve that indicates where the "sweet spot" is. You should be able to do this in a loop so that the results are compiled automatically overnight.

Jeremy Friesner
+1  A: 

As a ballpark number, you should probably keep the thread count between 10 and 100 to minimize lock contention and context switching overhead.

Andrew
+4  A: 

If the workload is anywhere near as I/O bound as it sounds, then you're probably going to get maximum throughput with about as many threads as you have spindles. If you have more than one disk and all data is on the same RAID 0, you probably don't want any more than one thread. If more than one thread is trying to access non-sequential parts of the disk, the OS must stop reading one file, even though it may be right under the head, and move to another part of the disk to service another thread, so that it doesn't starve. With only one thread, the disk need never stop reading to move the head.

Obviously that depends on the access patterns being very linear (such as with video recoding) and the data actually being unfragmented on disk, which it depends on a lot. If the workload is more CPU bound, then it won't matter quite as much and you can use more threads, since the disk will be twiddling its thumbs anyway.

As other posters suggest, profile first!

TokenMacGuy
Even with one thread per spindle, if all the files are actually on the same disk (they are in the same directory), then you won't get that much help. If the files are on RAID (or in logical volumes), let alone available over a SAN system, then you may as well go with at least one per spindle - there's enough stuff between your threads and the raw disks to make it pure guesswork how many threads to use.
Jonathan Leffler
It also depends how effective your I/O scheduler is. If you have 10 I/O requests to make, you may benefit from making them all at once, to allow an efficient elevator scheduler on the disk spindles. If you simply send one request at a time, the I/O subsystem won't help you at all, and you'll kill all possible parallelism opportunities.
Tom
+4  A: 

To elaborate it really depends on

IO boundedness of the problem
    how big are the files
    how contiguous are the files
    in what order must they be processed
    can you determine the disk placement
how much concurrency you can get in the "global structure insert"
    can you "silo" the data structure with a consolidation wrapper
the actual CPU cost of the "global structure insert"

For example if your files reside on a 3 terabyte flash memory array then the solution is different than if they reside on a single disk (where if the "global structure insert" takes less that the read the problem is I/O bounded and you might just as well have a 2 stage pipe with 2 threads - the read stage feeding the insert stage.)

But in both cases the architecture would probably be a vertical pipeline of 2 stages. n reading threads and m writing threads with n and m being determined by a "natural concurrency" for the stage.

Creating a thread per file will probably lead to disk thrashing. Just like you tailor the number of threads of a CPU bound process to the naturally achievable CPU concurrency (and going above that creates context switching overhead AKA thrashing) the same is true on the I/O side - in a sense you can think of the disk thrashing as "context switching on the disk".

pgast
+11  A: 
Kirill V. Lyadvinsky
Great link! Note that this article also continues, such as: try to minimize the sequentiality (e.g. by multiplexing the data in one file).
xtofl
Added link to all Sutter's articles. They are not related to OP's question directly, but could be helpful still.
Kirill V. Lyadvinsky
Problem is that this assumes no overhead on the other threads. This is almost never the case and I'd say its not really effective having more than 1 thread per core (as long as that thread is ALWAYS under load). (continued...)
Grant Peters
If you have a thread idling on the core for a while (such as a thread waiting for input), then it will probably be worthwhile adding another thread to that core. Also, the more threads a core runs, the more its got to move the cache around causing cache misses which will REALLY slow it down. Though this is a nice "law" (I don't think it can really be called that by scientific standards) it just doesn't hold up in the real world.
Grant Peters
In this particular scenario you also have hard disk access to contend with, which will be your real limiting factor. I'd suggest you have one thread dedicated to reading files into memory and then it hands them off to be processed by other threads otherwise, if you try to read from files on seperate threads, the hard disk will start jumping all over the place. This will massively slow down the read speed of the hard disk resulting in LONGER processing times. (In hind sight i probably should have posted this as an answer, but oh well, no rep for me).
Grant Peters
I find that a significant amount of experimenting can help tune the number of threads used in an application. Slow I/O can mean that your application will run faster with more threads than cores, particularly when waiting for a slow network. On the other hand, Grant is right in the respect that many threads accessing disk can be detrimental to system performance by causing an excess number of seeks, etc. At any rate, the theoretical best number of threads may provide a good guess, but it's not necessarily going to get you the best performance. I can say this from experience.
Tom
+1  A: 

I agree with everyone suggesting a thread pool: You schedule tasks with the pool, and the pool assigns threads to do the tasks.

If you're CPU-bound, simply keep adding threads as long as the CPU usage is below 100%. When you're I/O bound, disk thrashing might at some point prevent more threads from improving speed. That you'll have to find out yourself.

Have you seen Intel's Threading Building Blocks? Note that I cannot comment whether this is what you need. I only made a small toy project on Windows and that was a few years ago. (It was somewhat similar to yours, BTW: It recursively traverses a folder hierarchy and counts lines in source code files it finds.)

sbi
Adding threads until the CPU hits 100% may not be the best idea - excessive threading contention could kill your cache performance. But good on mentioning TBB.
Tom
+1  A: 

More threads will not necessarily give you higher throughput. Threads have a non-trivial cost, both to create (in terms of CPU time and OS resources) and to run (in terms of memory and scheduling). And the more threads you have, the more potential for contention with other threads. Adding threads can sometimes even slow down execution. Each problem is subtly different, and you are best off writing a nice, flexible solution and experimenting with the parameters to see what works best.

Your example code, spawning a thread for each file, would almost immediately swamp the system for values of max_threads beyond around 10. As others have suggested, a thread pool with a work queue is what you probably want. The fact that each file is independent is nice, as that makes it almost embarrassingly parallel (save for the aggregation at the end of each unit of work).

Some factors that will affect your throughput:

  • Number of CPU cores
  • The number of disk channels (spindles, RAID devices, etc)
  • The processing algorithm, and whether the problem is CPU or I/O bound
  • Contention for the master statistics structure

Last year I wrote an application that does essentially the same as you describe. I ended up using Python and the pprocess library. It used a multi-process model with a pool of worker processes, communicating via pipes (rather than threads). A master process would read the work queue, chop up the input into chunks, and send chunk info to the workers. A worker would crunch the data, collecting stats, and when it was done send the results to back the master. The master would combine the results with the global totals and send another chunk to the worker. I found it scaled almost linearly up to 8 worker threads (on an 8-core box, which is pretty good), and beyond that it degraded.

Some things to consider:

  • Use a thread pool with a work queue, where the number of threads is likely around the number of cores in your system
  • Alternatively, use a multi-process setup, which communicates via pipes
  • Evaluate using mmap() (or equivalent) to memory map the input files, but only after you've profiled the baseline case
  • Read data in multiples of the block size (eg. 4kB), and chop up into lines in memory
  • Build in detailed logging from the start, to aid debugging
  • Keep an eye on contention when updating the master statistics, although it will likely be swamped by the processing and read times of the data
  • Don't make assumptions - test, and measure
  • Set up a local dev environment that is as close as possible to the deployment system
  • Use a database (such as SQLite) for state data, processing results, etc
  • The database can keep track of which files have been processed, which lines had errors, warnings, etc
  • Only give your app read-only access to the original directory and files, and write your results elsewhere
  • Be careful not to try to process files that are open by another process (there's a few tricks here)
  • Careful you don't hit OS limits of the number of files per directory
  • Profile everything, but be sure to only change one thing at a time, and keep detailed records. Performance optimization is hard.
  • Set up scripts so you can consistently re-run tests. Having a DB helps here, as you can delete the records that flag a file as having been processed and re-run against the same data.

When you have a significant number of files in the one directory as you describe, aside from potentially hitting filesystem limits, the time to stat the directory and figure out which files you've already processed and which you still need to goes up significantly. Consider breaking up the files into subdirectories by date, for example.

One more word on performance profiling: be careful when extrapolating performance from small test data sets to super-huge data sets. You can't. I found out the hard way that you can reach a certain point where regular assumptions about resources that we make every day in programming just don't hold any more. For example, I only found out the statement buffer in MySQL is 16MB when my app went way over it! And keeping 8 cores busy can take a lot of memory, but you can easily chew up 2GB of RAM if you're not careful! At some point you have to test on real data on the production system, but give yourself a safe test sandbox to run in, so you don't munge production data or files.

Directly related to this discussion is a series of articles on Tim Bray's blog called the "Wide Finder" project. The problem was simply to parse logfiles and generate some simple statistics, but in the fastest manner possible on a multicore system. Many people contributed solutions, in a variety of languages. It is definitely worth reading.

gavinb
+4  A: 

You said the files are all in one directory. Does that mean they are all on one physical drive?

If that is so, and assuming they are not already cached, then your job will be to keep the single read head busy, and no amount of threading will help it. In fact, if it has to hop between tracks due to parallelism, you could slow it down.

On the other hand, if the computation part takes significant time, causing the read head to have to wait, then it might make sense to have >1 thread.

Often, using threads for performance is missing the point unless it lets you get parallel pieces of hardware working at the same time.

More often, the value of threads is in, for example, keeping track of multiple simultaneous conversations, like if you have multiple users, where each thread can wait for its own Johny or Suzy and not get confused.

Mike Dunlavey
+1 First answer basically saying "start with 1 and measure from there" ^^
Oskar Duveborn