More threads will not necessarily give you higher throughput. Threads have a non-trivial cost, both to create (in terms of CPU time and OS resources) and to run (in terms of memory and scheduling). And the more threads you have, the more potential for contention with other threads. Adding threads can sometimes even slow down execution. Each problem is subtly different, and you are best off writing a nice, flexible solution and experimenting with the parameters to see what works best.
Your example code, spawning a thread for each file, would almost immediately swamp the system for values of max_threads
beyond around 10. As others have suggested, a thread pool with a work queue is what you probably want. The fact that each file is independent is nice, as that makes it almost embarrassingly parallel (save for the aggregation at the end of each unit of work).
Some factors that will affect your throughput:
- Number of CPU cores
- The number of disk channels (spindles, RAID devices, etc)
- The processing algorithm, and whether the problem is CPU or I/O bound
- Contention for the master statistics structure
Last year I wrote an application that does essentially the same as you describe. I ended up using Python and the pprocess
library. It used a multi-process model with a pool of worker processes, communicating via pipes (rather than threads). A master process would read the work queue, chop up the input into chunks, and send chunk info to the workers. A worker would crunch the data, collecting stats, and when it was done send the results to back the master. The master would combine the results with the global totals and send another chunk to the worker. I found it scaled almost linearly up to 8 worker threads (on an 8-core box, which is pretty good), and beyond that it degraded.
Some things to consider:
- Use a thread pool with a work queue, where the number of threads is likely around the number of cores in your system
- Alternatively, use a multi-process setup, which communicates via pipes
- Evaluate using
mmap()
(or equivalent) to memory map the input files, but only after you've profiled the baseline case
- Read data in multiples of the block size (eg. 4kB), and chop up into lines in memory
- Build in detailed logging from the start, to aid debugging
- Keep an eye on contention when updating the master statistics, although it will likely be swamped by the processing and read times of the data
- Don't make assumptions - test, and measure
- Set up a local dev environment that is as close as possible to the deployment system
- Use a database (such as SQLite) for state data, processing results, etc
- The database can keep track of which files have been processed, which lines had errors, warnings, etc
- Only give your app read-only access to the original directory and files, and write your results elsewhere
- Be careful not to try to process files that are open by another process (there's a few tricks here)
- Careful you don't hit OS limits of the number of files per directory
- Profile everything, but be sure to only change one thing at a time, and keep detailed records. Performance optimization is hard.
- Set up scripts so you can consistently re-run tests. Having a DB helps here, as you can delete the records that flag a file as having been processed and re-run against the same data.
When you have a significant number of files in the one directory as you describe, aside from potentially hitting filesystem limits, the time to stat the directory and figure out which files you've already processed and which you still need to goes up significantly. Consider breaking up the files into subdirectories by date, for example.
One more word on performance profiling: be careful when extrapolating performance from small test data sets to super-huge data sets. You can't. I found out the hard way that you can reach a certain point where regular assumptions about resources that we make every day in programming just don't hold any more. For example, I only found out the statement buffer in MySQL is 16MB when my app went way over it! And keeping 8 cores busy can take a lot of memory, but you can easily chew up 2GB of RAM if you're not careful! At some point you have to test on real data on the production system, but give yourself a safe test sandbox to run in, so you don't munge production data or files.
Directly related to this discussion is a series of articles on Tim Bray's blog called the "Wide Finder" project. The problem was simply to parse logfiles and generate some simple statistics, but in the fastest manner possible on a multicore system. Many people contributed solutions, in a variety of languages. It is definitely worth reading.