views:

227

answers:

5

Hello

I'm going to work on comparing around 300 binary files using Scala, bytes-by-bytes, 4MB each. However, judging from what I've already done, processing 15 files at the same time using java.BufferedInputStream tooks me around 90 sec on my machine so I don't think my solution would scale well in terms of large number of files.

Ideas and suggestions are highly appreciated.

EDIT: The actual task is not just comparing the difference but to processing those files in the same sequence order. Let's say I have to look at byte ith in every file at the same time, and moving on to (ith + 1).

+6  A: 

Did you notice your hard drive slowly evaporating as you read the files? Reading that many files in parallel is not something mechanical hard drives are designed to do at full-speed.

If the files will always be this small (4MB is plenty small enough), I would read the entire first file into memory, and then compare each file with it in series.

I can't comment on solid-state drives, as I have no first-hand experience with their performance.

zildjohn01
This is the correct answer: read them all into memory if at all possible.
Shane C. Mason
A: 

I would suggest using nio if possible. Introudction To Java NIO and NIO2 seems like a decent guide to using NIO if you are not familiar with it. I would not suggest reading a file and doing a comparison byte by byte, if that is what you are currently doing. You can create a ByteBuffer to read in chunks of data from a file and then do comparisons from that.

faran
+1  A: 

Are the files exactly the same number of bytes? If they are not, the files can be compared simply via the File.length() method to determine a first-order guess of equality.

Of course you may be wanting to do a much deeper comparison than just "are these files the same?"

oxbow_lakes
+2  A: 

You are quite screwed, indeed.

Let's see... 300 * 4 MB = 1.2 GB. Does that fit your memory budget? If it does, by all means read them all into memory. But, to speed things up, you might try the following:

  1. Read 512 KB of every file, sequentially. You might try reading from 2 to 8 at the same time -- perhaps through Futures, and see how well it scales. Depending on your I/O system, you may gain some speed by reading a few files at the same time, but I do not expect it to scale much. EXPERIMENT! BENCHMARK!

  2. Process those 512 KB using Futures.

  3. Go back to step 1, unless you are finished with the files.

  4. Get the result back from the processing Futures.

On step number 1, by limiting the parallel reads you avoid trashing your I/O subsystem. Push it as much as you can, maybe a bit less than that, but definitely not more than that.

By not reading all files on step number 1, you use some of the time spent reading these files doing useful CPU work. You may experiment with lowering the bytes read on step 1 as well.

Daniel
+1  A: 

If you are just looking to see if they are the same I would suggest using a hashing algorithm like SHA1 to see if they match. Here is some java source to make that happen

many large systems that handle data use sha1 Including the NSA and git Its simply more efficient use a hash instead of a byte compare. the hashes can also be stored for later to see if the data has been altered.

Here is a talk by Linus Torvalds specifically about git, it also mentions why he uses SHA1.