views:

100

answers:

8

I have to search for a string in 10 large size files (in zip format 70 MB) and have to print the lines with the search string to corresponding 10 output files.(i.e. file 1 output should be in output_file1...file2---> output_file2). The same program takes 15 mins for a single file. But if use 10 threads to read 10 files and to write in 10 different files it should complete in 15 mins but its taking 40 mins.

How can I solve this. Or multithreading will take this much time only?

+4  A: 

I suppose you don't have a 10-core-cpu-machine in use - so your threads are not really running parallel. therefore it takes a bit longer than it mathematically should. next thing is you have to be aware that the thread-management also takes a bit of time (which is irrelevant). maybe you can speed up your search-mechanism for the files to gain some speed. for this you would need to post your source-code. but some advises:

  • you should try to keep the file-access count as low as possible since it is the slowest operation
  • try to use as less memory as possible, because if your machine starts to swap memory pages the speed is also reduced dramatically
  • since your doing this in java - you should use reg-ex to find the string inside the strings, because (as far as I remember) it is the fastest way to search through strings in java

but be aware that this measures may result in a very complex code to read for another human being or yourself in ... let's say six month+ because you won't remember everything you did and why you did it (comments ;))

Gambrinus
If you are searching for a literal, indexOf(startIdx, pattern) is way faster than regex.
ddimitrov
A: 

You probably have hard disk contention, which multi-threading won't help. In your case, you probably want just enough threads to keep the disk drive at 100% usage.

I'm assuming that the hard disk is your bottleneck, not the CPU. Multi-threading only gets things done "faster" if each thread doesn't have to fight over the same hardware. So, with multiple cores (CPUs) and multiple hard drives, you'll see better performance with multi-threading.

I'm surprised that it takes 15 minutes for a single file.

Here's how I would design this. 70 MB is not large. You could probably load each 70 MB uncompressed file into memory, one per thread. Then, unzip the data in real time as you search the compressed stream, keeping a bare amount of uncompressed data in memory. (Once you've searched it, throw it away). This will avoid hard disk thrashing and allow your CPU to reach 100% usage.

If memory is an issue, then load a few MB at a time from disk.

Marcus Adams
15 mins for 70 MB over 10 files isn't unlikely to disk issue (unless it is *really* fragmented).
Tom Hawtin - tackline
+5  A: 

Accessing files concurrently typically goes slower after 2-3 threads since the hard disk ends up thrashing about trying to read from all files simultaneously, similar to reading a defragmented file.

To avoid this, split the work into file readers and file parsers. The file readers bring in the data from the file (also decompressing), and the file parsers parse the data. You could use PipedInputStream/PipedOutputStream to forward data from your file readers to file parsers.

Because your files are zipped, reading involves both I/O and cpu, which can be interleaved nicely across 2-4 threads reading all files. For parsing the files, it's easiest to have just one thread reading from the PipedInputStream, so you will have one parser thread per file. Using multiple threads per file requires splitting up the stream and handling seaching at block boundaries, which complicates the process, and is not necessary here, since you probably have sufficient parallelism with 10 parser threads and 2-4 reader threads.

mdma
A: 

More threads will most likely just make it run slower since your bottleneck is going to be Disk IO. If you could load all your data into memory first, then you would see some speed up from multiple threads, but only to the point of #cores + 1, anything more would just be context switching overhead.

fuzzy lollipop
A: 

When you run this, is your CPU at 100% already? If not, one of two things;

  • if it's the hard drive, you could try moving to a faster hard drive, a RAID0 stripe (dangerous for data loss), or RAID5.
  • you have a multi-core CPU, and for some reason, it's not being run on all of the cores. You can check this somewhat in Windows by hitting CTRL-ALT-DEL, Task Manager, Performance tab. If the CPU usage history is maxed out on one graph, you have underutilized processors, and could consider threading. If the CPU usage isn't maxed anywhere, you have a hard drive bottleneck, and no amount of threading will do much for performance. If the CPU usage is maxed everywhere, then threading will only make this slower; you need a faster CPU to run that task faster, or a better algorithm.
Dean J
A: 

I am going to guess this is a GC issue. I guess you are reading the files a line at a time into a String. Perhaps you are even recompiling a regex for every line. Anyway, lots of memory allocation, but short lived objects. Multiple threads might tip enough of this in to getting copied into the "survivor" spaces (in typical Sun GC implementation). I guess use visualvm, or an obscure command line argument, to monitor how hard the GC is working.

Could also be a lock contention issue, but this looks embarrassingly parallel.

Tom Hawtin - tackline
A: 

You might want to look at the "Wide Finder" project Tim Bray created. It sounds very much like what you are doing, and I think examines most if not all of the issues you'll run into. hth

mezmo
A: 
Geeta