tags:

views:

237

answers:

3

I am trying to improve an external sort implementation in java.

I have a bunch of BufferedReader objects open for temporary files. I repeatedly remove the top line from each of these files. This pushes the limits of the Java's Heap. I would like a more scalable method of doing this without loosing speed because of a bunch of constructor calls.

One solution is to only open files when they are needed, then read the first line and then delete it. But I am afraid that this will be significantly slower.

So using Java libraries what is the most efficient method of doing this.

--Edit--

For external sort, the usual method is to break a large file up into several chunk files. Sort each of the chunks. And then treat the sorted files like buffers, pop the top item from each file, the smallest of all those is the global minimum. Then continue until for all items. http://en.wikipedia.org/wiki/External_sorting

My temporary files (buffers) are basically BufferedReader objects. The operations performed on these files are the same as stack/queue operations (peek and pop, no push needed).

I am trying to make these peek and pop operations more efficient. This is because using many BufferedReader objects takes up too much space.

+1  A: 

If heap space is the main concern, use the [2nd form of the BufferedReader constructor][1] and specify a small buffer size.

[1]: http://java.sun.com/j2se/1.5.0/docs/api/java/io/BufferedReader.html#BufferedReader(java.io.Reader, int)

mobrule
This will help!
christangrant
+1  A: 

I have a bunch of BufferedReader objects open for temporary files. I repeatedly remove the top line from each of these files. This pushes the limits of the Java's Heap.

This is a really surprising claim. Unless you have thousands files open at the same time, there is no way that that should stress the heap. The default buffer size for a BufferedReader is 8192 bytes, and there should be little extra space required. 8192 * 1000 is only ~8Mbytes, and that is tiny compared with a typical Java application's memory usage.

Consider the possibility that something else is causing the heap problems. For example, if your program retained references to each line that it read, THAT would lead to heap problems.

(Or maybe your notion of what is "too much space" is unrealistic.)

One solution is to only open files when they are needed, then read the first line and then delete it. But I am afraid that this will be significantly slower.

There is no doubt that it would be significantly slower! There is simply no efficient way to delete the first line from a file. Not in Java, or in any other language. Deleting characters from the beginning or middle of a file entails copying the file to a new one while skipping over the characters that need to be removed. There is no faster alternative.

Stephen C
+1  A: 

I'm away from my compiler at the moment, but I think this will work. Edit: works fine.

I urge you to profile it and see. I bet the constructor calls are going to be nothing compared to the file I/O and your comparison operations.

public class FileStack {
  private File file;
  private long position = 0;
  private String cache = null;

  public FileStack(File file) {
    this.file = file;
  }

  public String peek() throws IOException {
    if (cache != null) {
      return cache;
    }

    BufferedReader r = new BufferedReader(new FileReader(file));
    try {
      r.skip(position);
      cache = r.readLine();
      return cache;
    } finally {
      r.close();
    }
  }

  public String pop() throws IOException {
    String r = peek();
    if (r != null) {
      // if you have \r\n line endings, you may need +2 instead of +1
      // if lines could end either way, you'll need something more complicated
      position += r.length() + 1;
      cache = null;
    }
    return r;
  }
}
dave
I tried it out, open and close definitely dominate!
christangrant