views:

2581

answers:

9

I have two (2GB each) files on my harddisk and want to compare them with each other:

  • Copying the original files with Windows explorer takes approx. 2-4 minutes (that is reading and writing - on the same physical and logical disk).
  • Reading with java.io.FileInputStream twice and comparing the byte arrays on a byte per byte basis takes 20+ minutes.
  • java.io.BufferedInputStream buffer is 64kb, the files are read in chunks and then compared.
  • Comparison is done is a tight loop like

    int numRead = Math.min(numRead[0], numRead[1]);
    for (int k = 0; k < numRead; k++)
    {
       if (buffer[1][k] != buffer[0][k])
       {
          return buffer[0][k] - buffer[1][k];
       }
    }
    

What can I do to speed this up? Is NIO supposed to be faster then plain streams? Is Java unable to use DMA/SATA technologies and does some slow OS-API calls instead?

EDIT:
Thanks for the answers. I did some experiments based on them. As Andreas showed

streams or nio approaches do not differ much.
More important is the correct buffer size.

This is confirmed by my own experiments. As the files are read in big chunks, even additional buffers (BufferedInputStream) do not give anything. Optimising the comparison is possible and I got the best results with 32-fold unrolling, but the time spend in comparison is small compared to disk read, so the speedup is small. Looks like there is nothing I can do ;-(

+6  A: 

With such large files, you are going to get MUCH better performance with java.nio.

Additionally, reading single bytes with java streams can be very slow. Using a byte array (2-6K elements from my own experiences, ymmv as it seems platform/application specific) will dramatically improve your read performance with streams.

Stu Thompson
i was "afraid" of that. code is quite old and worked good since long, but files tend to grow all the time...
Peter Kofler
If you go with a MappedByteBuffer (which uses the OS's virtual memory paging subsystem) you may be able to minimize changes to the code and yet get substantial improvements in speed. I would venture a guess of "orders of magnitude" faster.
Stu Thompson
NIO can be confusing, you will want to start with ByteBuffer.allocateDirect() to get the highest performance (it uses memory-mapped files at that point). http://java.sun.com/javase/6/docs/api/java/nio/ByteBuffer.html
AgileJon
File I/O with NIO is not so confusing—it's arguably just as simple as `java.io`. If you want to get confused, jump into the asynchronous networking stuff.
Stu Thompson
+4  A: 

Reading and writing the files with Java can be just as fast. You can use FileChannels. As for comparing the files, obviously this will take a lot of time comparing byte to byte Here's an example using FileChannels and ByteBuffers (could be further optimized):

public static boolean compare(String firstPath, String secondPath, final int BUFFER_SIZE) throws IOException {
    FileChannel firstIn = null, secondIn = null;
    try {
        firstIn = new FileInputStream(firstPath).getChannel();
        secondIn = new FileInputStream(secondPath).getChannel();
        if (firstIn.size() != secondIn.size())
            return false;
        ByteBuffer firstBuffer = ByteBuffer.allocateDirect(BUFFER_SIZE);
        ByteBuffer secondBuffer = ByteBuffer.allocateDirect(BUFFER_SIZE);
        int firstRead, secondRead;
        while (firstIn.position() < firstIn.size()) {
            firstRead = firstIn.read(firstBuffer);
            secondRead = secondIn.read(secondBuffer);
            if (firstRead != secondRead)
                return false;
            if (!buffersEqual(firstBuffer, secondBuffer, firstRead))
                return false;
        }
        return true;
    } finally {
        if (firstIn != null) firstIn.close();
        if (secondIn != null) firstIn.close();
    }
}

private static boolean buffersEqual(ByteBuffer first, ByteBuffer second, final int length) {
    if (first.limit() != second.limit())
        return false;
    if (length > first.limit())
        return false;
    first.rewind(); second.rewind();
    for (int i=0; i<length; i++)
        if (first.get() != second.get())
            return false;
    return true;
}
laginimaineb
do you have any ideas about comparing faster than byte per byte?
Peter Kofler
Well... As I said, you could use FileChannels (and ByteBuffers). I can compare two 1.6GB files in 60 seconds. I've edited my original post to include the code I use.
laginimaineb
i like this example. you don't need to read the WHOLE files into arrays to compare them. otherwise, you wast lots of time reading files that might be the same at byte 1, instead of reading 2 bytes.
John Gardner
A: 

DMA/SATA are hardware/low-level techlonogies and aren't visible to any programming language whatsoever.

For memory mapped input/output you should use java.nio, I believe.

Are you sure that you aren't reading those files by one byte? That would be wasteful, I'd recommend doing it block-by-block, and each block should be something like 64 megabytes to minimize seeking.

alamar
you mean 64kb, yes? not megabytes?
Stu Thompson
Why, megabytes, if you can afford it (you can today).Reading two files by 64 kilobytes isn't a good idea IMO because the drive would seek endlessly.
alamar
Oh, thought it was a type. This smells like premature optimisation to me. I question such a large value because I think the read will block until the entire 64MB is read in, resulting slower overall performance. Only actual performance metrics will show conclusively, but I am highly skeptical of your theory.
Stu Thompson
Well, do the tests, draw the graph.
alamar
Reading 64mb of sequentially data aren't that slow.
Kimble
Well, but reading 65536 one-kilobyte blocks and writing another one can be costly.Operating system would probably not let you to do that anyway, it would cache changes and then sync.
alamar
+2  A: 

You can have a look at Suns Article for I/O Tuning (altough already a bit dated), maybe you can find similarities between the examples there and your code. Also have a look at the java.nio package which contains faster I/O elements than java.io. Dr. Dobbs Journal has a quite nice article on high performance IO using java.nio.

If so, there are further examples and tuning tips available there which should be able to help you to speed up your code.

Furthermore the Arrays class has methods for comparing byte arrays build in, maybe these can also be used to make things faster and clear up your loop a bit.

Kosi2801
good idea with Arrays class. Under the hood it is doing byte-wise comparison in a tight loop. It is not as fast as the 32fold unrolled loop I am using currently, but would shorten the code considerably, esp. for testing the IO performance.
Peter Kofler
+3  A: 

The following is a good article on the relative merits of the different ways to read a file in java. May be of some use:

How to read files quickly

Gareth Davis
A: 

For a better comparison try copying two files at once. A hard drive can read one file much more efficiently than reading two (as the head has to move back and forth to read) One way to reduce this is to use larger buffers, e.g. 16 MB. with ByteBuffer.

With ByteBuffer you can compare 8-bytes at a time by comparing long values with getLong()

If your Java is efficient, most of the work is in the disk/OS for reading and writing so it shouldn't be much slower than using any other language (as the disk/OS is the bottleneck)

Don't assume Java is slow until you have determined its not a bug in your code.

Peter Lawrey
I question comparing longs as they have to be constructed on the fly. Shouldn't it be faster to unroll the loop 8 times? (or 32 times that seemed to be optimal in my experiments).I agree that "select is not broken", so IO will be fast, but I know from the past (and maybe it's not true any more) that Java IO is/was much slower than say Pascal/C IO. But as most apps contain more than plain IO, Java is still faster in total nowadays.
Peter Kofler
For direct ByteBuffers, longs are not constructed in Java. It is just one call to JNI (they might be constructed in C code) I am pretty sure its faster than one call per byte. In a later answer I demonstrate this performs @ 74.8 MB/s reading two files from the same disk which may be fast enough.
Peter Lawrey
+4  A: 

I tried out three different methods of comparing two identical 3,8 gb files with buffer sizes between 8 kb and 1 MB. the first first method used just two buffered input streams

the second approach uses a threadpool that reads in two different threads and compares in a third one. this got slightly higher throughput at the expense of a high cpu utilisation. the managing of the threadpool takes a lot of overhead with those short-running tasks.

the third approach uses nio, as posted by laginimaineb

as you can see, the general approach does not differ much. more important is the correct buffer size.

what is strange that i read 1 byte less using threads. i could not spot the error tough.

comparing just with two streams
I was equal, even after 3684070360 bytes and reading for 704813 ms (4,98MB/sec * 2) with a buffer size of 8 kB
I was equal, even after 3684070360 bytes and reading for 578563 ms (6,07MB/sec * 2) with a buffer size of 16 kB
I was equal, even after 3684070360 bytes and reading for 515422 ms (6,82MB/sec * 2) with a buffer size of 32 kB
I was equal, even after 3684070360 bytes and reading for 534532 ms (6,57MB/sec * 2) with a buffer size of 64 kB
I was equal, even after 3684070360 bytes and reading for 422953 ms (8,31MB/sec * 2) with a buffer size of 128 kB
I was equal, even after 3684070360 bytes and reading for 793359 ms (4,43MB/sec * 2) with a buffer size of 256 kB
I was equal, even after 3684070360 bytes and reading for 746344 ms (4,71MB/sec * 2) with a buffer size of 512 kB
I was equal, even after 3684070360 bytes and reading for 669969 ms (5,24MB/sec * 2) with a buffer size of 1024 kB
comparing with threads
I was equal, even after 3684070359 bytes and reading for 602391 ms (5,83MB/sec * 2) with a buffer size of 8 kB
I was equal, even after 3684070359 bytes and reading for 523156 ms (6,72MB/sec * 2) with a buffer size of 16 kB
I was equal, even after 3684070359 bytes and reading for 527547 ms (6,66MB/sec * 2) with a buffer size of 32 kB
I was equal, even after 3684070359 bytes and reading for 276750 ms (12,69MB/sec * 2) with a buffer size of 64 kB
I was equal, even after 3684070359 bytes and reading for 493172 ms (7,12MB/sec * 2) with a buffer size of 128 kB
I was equal, even after 3684070359 bytes and reading for 696781 ms (5,04MB/sec * 2) with a buffer size of 256 kB
I was equal, even after 3684070359 bytes and reading for 727953 ms (4,83MB/sec * 2) with a buffer size of 512 kB
I was equal, even after 3684070359 bytes and reading for 741000 ms (4,74MB/sec * 2) with a buffer size of 1024 kB
comparing with nio
I was equal, even after 3684070360 bytes and reading for 661313 ms (5,31MB/sec * 2) with a buffer size of 8 kB
I was equal, even after 3684070360 bytes and reading for 656156 ms (5,35MB/sec * 2) with a buffer size of 16 kB
I was equal, even after 3684070360 bytes and reading for 491781 ms (7,14MB/sec * 2) with a buffer size of 32 kB
I was equal, even after 3684070360 bytes and reading for 317360 ms (11,07MB/sec * 2) with a buffer size of 64 kB
I was equal, even after 3684070360 bytes and reading for 643078 ms (5,46MB/sec * 2) with a buffer size of 128 kB
I was equal, even after 3684070360 bytes and reading for 865016 ms (4,06MB/sec * 2) with a buffer size of 256 kB
I was equal, even after 3684070360 bytes and reading for 716796 ms (4,90MB/sec * 2) with a buffer size of 512 kB
I was equal, even after 3684070360 bytes and reading for 652016 ms (5,39MB/sec * 2) with a buffer size of 1024 kB

the code used:

import junit.framework.Assert;
import org.junit.Before;
import org.junit.Test;

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.nio.channels.FileChannel;
import java.text.DecimalFormat;
import java.text.NumberFormat;
import java.util.Arrays;
import java.util.concurrent.*;

public class FileCompare {

    private static final int MIN_BUFFER_SIZE = 1024 * 8;
    private static final int MAX_BUFFER_SIZE = 1024 * 1024;
    private String fileName1;
    private String fileName2;
    private long start;
    private long totalbytes;

    @Before
    public void createInputStream() {
        fileName1 = "bigFile.1";
        fileName2 = "bigFile.2";
    }

    @Test
    public void compareTwoFiles() throws IOException {
        System.out.println("comparing just with two streams");
        int currentBufferSize = MIN_BUFFER_SIZE;
        while (currentBufferSize <= MAX_BUFFER_SIZE) {
            compareWithBufferSize(currentBufferSize);
            currentBufferSize *= 2;
        }
    }

    @Test
    public void compareTwoFilesFutures() 
            throws IOException, ExecutionException, InterruptedException {
        System.out.println("comparing with threads");
        int myBufferSize = MIN_BUFFER_SIZE;
        while (myBufferSize <= MAX_BUFFER_SIZE) {
            start = System.currentTimeMillis();
            totalbytes = 0;
            compareWithBufferSizeFutures(myBufferSize);
            myBufferSize *= 2;
        }
    }

    @Test
    public void compareTwoFilesNio() throws IOException {
        System.out.println("comparing with nio");
        int myBufferSize = MIN_BUFFER_SIZE;
        while (myBufferSize <= MAX_BUFFER_SIZE) {
            start = System.currentTimeMillis();
            totalbytes = 0;
            boolean wasEqual = isEqualsNio(myBufferSize);

            if (wasEqual) {
                printAfterEquals(myBufferSize);
            } else {
                Assert.fail("files were not equal");
            }

            myBufferSize *= 2;
        }

    }

    private void compareWithBufferSize(int myBufferSize) throws IOException {
        final BufferedInputStream inputStream1 =
                new BufferedInputStream(
                        new FileInputStream(new File(fileName1)),
                        myBufferSize);
        byte[] buff1 = new byte[myBufferSize];
        final BufferedInputStream inputStream2 =
                new BufferedInputStream(
                        new FileInputStream(new File(fileName2)),
                        myBufferSize);
        byte[] buff2 = new byte[myBufferSize];
        int read1;

        start = System.currentTimeMillis();
        totalbytes = 0;
        while ((read1 = inputStream1.read(buff1)) != -1) {
            totalbytes += read1;
            int read2 = inputStream2.read(buff2);
            if (read1 != read2) {
                break;
            }
            if (!Arrays.equals(buff1, buff2)) {
                break;
            }
        }
        if (read1 == -1) {
            printAfterEquals(myBufferSize);
        } else {
            Assert.fail("files were not equal");
        }
        inputStream1.close();
        inputStream2.close();
    }

    private void compareWithBufferSizeFutures(int myBufferSize)
            throws ExecutionException, InterruptedException, IOException {
        final BufferedInputStream inputStream1 =
                new BufferedInputStream(
                        new FileInputStream(
                                new File(fileName1)),
                        myBufferSize);
        final BufferedInputStream inputStream2 =
                new BufferedInputStream(
                        new FileInputStream(
                                new File(fileName2)),
                        myBufferSize);

        final boolean wasEqual = isEqualsParallel(myBufferSize, inputStream1, inputStream2);

        if (wasEqual) {
            printAfterEquals(myBufferSize);
        } else {
            Assert.fail("files were not equal");
        }
        inputStream1.close();
        inputStream2.close();
    }

    private boolean isEqualsParallel(int myBufferSize
            , final BufferedInputStream inputStream1
            , final BufferedInputStream inputStream2)
            throws InterruptedException, ExecutionException {
        final byte[] buff1Even = new byte[myBufferSize];
        final byte[] buff1Odd = new byte[myBufferSize];
        final byte[] buff2Even = new byte[myBufferSize];
        final byte[] buff2Odd = new byte[myBufferSize];
        final Callable<Integer> read1Even = new Callable<Integer>() {
            public Integer call() throws Exception {
                return inputStream1.read(buff1Even);
            }
        };
        final Callable<Integer> read2Even = new Callable<Integer>() {
            public Integer call() throws Exception {
                return inputStream2.read(buff2Even);
            }
        };
        final Callable<Integer> read1Odd = new Callable<Integer>() {
            public Integer call() throws Exception {
                return inputStream1.read(buff1Odd);
            }
        };
        final Callable<Integer> read2Odd = new Callable<Integer>() {
            public Integer call() throws Exception {
                return inputStream2.read(buff2Odd);
            }
        };
        final Callable<Boolean> oddEqualsArray = new Callable<Boolean>() {
            public Boolean call() throws Exception {
                return Arrays.equals(buff1Odd, buff2Odd);
            }
        };
        final Callable<Boolean> evenEqualsArray = new Callable<Boolean>() {
            public Boolean call() throws Exception {
                return Arrays.equals(buff1Even, buff2Even);
            }
        };

        ExecutorService executor = Executors.newCachedThreadPool();
        boolean isEven = true;
        Future<Integer> read1 = null;
        Future<Integer> read2 = null;
        Future<Boolean> isEqual = null;
        int lastSize = 0;
        while (true) {
            if (isEqual != null) {
                if (!isEqual.get()) {
                    return false;
                } else if (lastSize == -1) {
                    return true;
                }
            }
            if (read1 != null) {
                lastSize = read1.get();
                totalbytes += lastSize;
                final int size2 = read2.get();
                if (lastSize != size2) {
                    return false;
                }
            }
            isEven = !isEven;
            if (isEven) {
                if (read1 != null) {
                    isEqual = executor.submit(oddEqualsArray);
                }
                read1 = executor.submit(read1Even);
                read2 = executor.submit(read2Even);
            } else {
                if (read1 != null) {
                    isEqual = executor.submit(evenEqualsArray);
                }
                read1 = executor.submit(read1Odd);
                read2 = executor.submit(read2Odd);
            }
        }
    }

    private boolean isEqualsNio(int myBufferSize) throws IOException {
        FileChannel first = null, seconde = null;
        try {
            first = new FileInputStream(fileName1).getChannel();
            seconde = new FileInputStream(fileName2).getChannel();
            if (first.size() != seconde.size()) {
                return false;
            }
            ByteBuffer firstBuffer = ByteBuffer.allocateDirect(myBufferSize);
            ByteBuffer secondBuffer = ByteBuffer.allocateDirect(myBufferSize);
            int firstRead, secondRead;
            while (first.position() < first.size()) {
                firstRead = first.read(firstBuffer);
                totalbytes += firstRead;
                secondRead = seconde.read(secondBuffer);
                if (firstRead != secondRead) {
                    return false;
                }
                if (!nioBuffersEqual(firstBuffer, secondBuffer, firstRead)) {
                    return false;
                }
            }
            return true;
        } finally {
            if (first != null) {
                first.close();
            }
            if (seconde != null) {
                seconde.close();
            }
        }
    }

    private static boolean nioBuffersEqual(ByteBuffer first, ByteBuffer second, final int length) {
        if (first.limit() != second.limit() || length > first.limit()) {
            return false;
        }
        first.rewind();
        second.rewind();
        for (int i = 0; i < length; i++) {
            if (first.get() != second.get()) {
                return false;
            }
        }
        return true;
    }

    private void printAfterEquals(int myBufferSize) {
        NumberFormat nf = new DecimalFormat("#.00");
        final long dur = System.currentTimeMillis() - start;
        double seconds = dur / 1000d;
        double megabytes = totalbytes / 1024 / 1024;
        double rate = (megabytes) / seconds;
        System.out.println("I was equal, even after " + totalbytes
                + " bytes and reading for " + dur
                + " ms (" + nf.format(rate) + "MB/sec * 2)" +
                " with a buffer size of " + myBufferSize / 1024 + " kB");
    }
}
Andreas Petersson
+1. Good work, Andreas. Could I bother you to run it with a 64MB (yes, megabytes) buffer on the same data and machine? @alamar thinks this will somehow deliver magically excellent results because of a lack of seeking, of which I am skeptical having real world experience closer to your own results.
Stu Thompson
My experiments as well showed that buffer sizes 64kb/128kb are optimal, like in your tests. For reading a byte[] of 64kb in one step, it is not important if I use BufferedInputStream on top of FileInputStream or not, they perform the same. Although I had problems because after the file was read once, timings get smaller due to disk caching propably.
Peter Kofler
+2  A: 

After modifying your NIO compare function I get the following results.

I was equal, even after 4294967296 bytes and reading for 304594 ms (13.45MB/sec * 2) with a buffer size of 1024 kB
I was equal, even after 4294967296 bytes and reading for 225078 ms (18.20MB/sec * 2) with a buffer size of 4096 kB
I was equal, even after 4294967296 bytes and reading for 221351 ms (18.50MB/sec * 2) with a buffer size of 16384 kB

Note: this means the files are being read at a rate of 37 MB/s

Running the same thing on a faster drive

I was equal, even after 4294967296 bytes and reading for 178087 ms (23.00MB/sec * 2) with a buffer size of 1024 kB
I was equal, even after 4294967296 bytes and reading for 119084 ms (34.40MB/sec * 2) with a buffer size of 4096 kB
I was equal, even after 4294967296 bytes and reading for 109549 ms (37.39MB/sec * 2) with a buffer size of 16384 kB

Note: this means the files are being read at a rate of 74.8 MB/s

private static boolean nioBuffersEqual(ByteBuffer first, ByteBuffer second, final int length) {
    if (first.limit() != second.limit() || length > first.limit()) {
        return false;
    }
    first.rewind();
    second.rewind();
    int i;
    for (i = 0; i < length-7; i+=8) {
        if (first.getLong() != second.getLong()) {
            return false;
        }
    }
    for (; i < length; i++) {
        if (first.get() != second.get()) {
            return false;
        }
    }
    return true;
}
Peter Lawrey
A: 

Try setting the buffer on the input stream up to several megabytes.

Thorbjørn Ravn Andersen