views:

645

answers:

9

Hi guys.

I need the advice from someone who knows very well java and the memory issues. I have a large file (something like 1.5GB) and I need to cut this file in many(100 small files for example) smaller files. I Know generally how to do it (using a BufferedReader), but I would like to know if you have any advice regarding the memory, or tips how to do it faster. My file containt text, it is not binary and I have about 20 character per line.

Thanks for any advice.

Best regards, C.C.

+1  A: 

You can use java.nio which is faster than classical Input/Output stream:

http://java.sun.com/javase/6/docs/technotes/guides/io/index.html

Kartoch
See my comment on Michael Borgwardt's post.
james
+3  A: 

To save memory, do not unnecessarily store/duplicate the data in memory (i.e. do not assign them to variables outside the loop). Just process the output immediately as soon as the input comes in.

It really doesn't matter whether you're using BufferedReader or not. It will not cost significantly much more memory as some implicitly seem to suggest. It will at highest only hit a few % from performance. The same applies on using NIO. It will only improve scalability, not memory use. It will only become interesting when you've hundreds of threads running on the same file.

Just loop through the file, write every line immediately to other file as you read in, count the lines and if it reaches 100, then switch to next file, etcetera.

Kickoff example:

String encoding = "UTF-8";
int maxlines = 100;
BufferedReader reader = null;
BufferedWriter writer = null;

try {
    reader = new BufferedReader(new InputStreamReader(new FileInputStream("/bigfile.txt"), encoding));
    int count = 0;
    for (String line; (line = reader.readLine()) != null;) {
        if (count++ % maxlines == 0) {
            close(writer);
            writer = new BufferedWriter(new OutputStreamWriter(new FileOutputStream("/smallfile" + (count / maxlines) + ".txt"), encoding));
        }
        writer.write(line);
        writer.newLine();
    }
} finally {
    close(writer);
    close(reader);
}
BalusC
Yes, just pipe it from the FileInputStream to the FilOutputStream using only a suitably sized byte buffer array.
Martin Wickman
It does not work for me to count the lines.The thing is: I have a file and I need to split it in 200 (this can change, it will come from the database) files for example.How do I do that? Just counting the line does not work. How else ?
CC
Then count the amount of bytes written instead of the amount of lines. You can know the file size in bytes beforehand.
BalusC
Using lineStr.getBytes().length ?
CC
For example. Don't forget the specify the proper encoding! E.g. `line.getBytes(encoding)`. Else it will mess up. The byte length depends on the character encoding used. If you actually don't worry about txt lines, then I would rather use `InputStream`/`OutputStream` instead and count the transferred bytes. By the way, it's unclear whether you mean to say that the files are stored in the DB or that the file split parameters are stored in the DB. If the files are actually also stored in the DB, then this may be memory hogging as wel. The exact solution will depend on the DB used.
BalusC
A: 

Don't use read without arguments. It's very slow. Better read it to buffer and move it to file quickly.

Use bufferedInputStream because it supports binary reading.

And it's all.

oneat
+2  A: 

This is a very good article: http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/

In summary, for great performance, you should:

  1. Avoid accessing the disk.
  2. Avoid accessing the underlying operating system.
  3. Avoid method calls.
  4. Avoid processing bytes and characters individually.

For example, to reduce the access to disk, you can use a large buffer. The article describes various approaches.

Bruno Rothgiesser
+4  A: 

First, if your file contains binary data, then using BufferedReader would be a big mistake (because you would be converting the data to String, which is unnecessary and could easily corrupt the data); you should use a BufferedInputStream instead. If it's text data and you need to split it along linebreaks, then using BufferedReader is OK (assuming the file contains lines of a sensible lenght).

Regarding memory, there shouldn't be any problem if you use a decently sized buffer (I'd use at least 1MB to make sure the HD is doing mostly sequential reading and writing).

If speed turns out to be a problem, you could have a look at the java.nio packages - those are supposedly faster than java.io,

Michael Borgwardt
Yes, I will use BufferedReader because I have a text file and I need to read it line by line.Now I have another problem: I cannot detect the size of the new file when writing it.The idea is that when the size of the new file > xx MB then generate a new file.
CC
@CC: you could simply keep adding up the String length of the lines you are copying. But it depends on the character encoding how that translates to file size (and doesn't work well at all with variable-length encodings such as UTF-8)
Michael Borgwardt
i would suggest adding a custom FilterOutputStream between the FileOutputStream (on the bottom) and OutputStreamWriter. Implement this filter to just keep track of the number of bytes going through it (apache commons io may have such a utility in it already).
james
Also, a common mis-perception is that "nio" is _faster_ than "io". This may be the case in certain situations, but generally "nio" was written to be more _scalable_ than "io", where "scalable" is not necessarily the same as "faster".
james
@james: the filter won't yield the correct result when there's a BufferedWriter above it, though the difference may not be large enough to matter.
Michael Borgwardt
It will be behind, yes, but the alternative is trying to approximate bytes from chars which, as you pointed out, is ugly. I am assuming there is a fudge factor anyway. If the count needs to be _very_ accurate, you could flush after each line, but that will of course slow performance.
james
It is highly unlikely that using a 1MiB buffer will be any faster than somewhere between 8 and 16 KiB.
Software Monkey
@Software Monkey: Hm, wouldn't accessing the HD in 16 KiB chunks send it thrashing pretty badly? Perhaps the OS or hardware cache will alleviate that via prefetching. In the end, the optimal buffer size is probably best determined via benchmarks based on the actual use case.
Michael Borgwardt
@Michael: In my testing, bulk reading/writing ceased to gain any meaningful throughput increase for buffers larger than this. YMMV. At the time, circa early 2000's the sweet spot seemed to be about 10 K; it might be a little larger now, but probably not by much. It's likely to be about X * disk allocation unit, where X is quite small.
Software Monkey
+5  A: 

You can consider using memory-mapped files, via FileChannels .

Generally a lot faster for large files. There are performance trade-offs that could make it slower, so YMMV.

Related answer: http://stackoverflow.com/questions/1605332/java-nio-filechannel-versus-fileoutputstream-performance-usefulness

Ryan Emerle
If you are just reading straight through a file, this will most likely not get you much of anything.
james
Still worth mentioning :)
Ryan Emerle
A: 

Does it have to be done in Java? I.e. does it need to be platform independent? If not, I'd suggest using the 'split' command in *nix. If you really wanted, you could execute this command via your java program. While I haven't tested, I imagine it perform faster than whatever Java IO implementation you could come up with.

Mike
A: 

Unless you accidentally read in the whole input file instead of reading it line by line, then your primary limitation will be disk speed. You may want to try starting with a file containing 100 lines and write it to 100 different files one line in each and make the triggering mechanism work on the number of lines written to the current file. That program will be easily scalable to your situation.

Thorbjørn Ravn Andersen
A: 

Yes. I also think that using read() with arguments like read(Char[], int init, int end) is a better way to read a such a large file (Eg : read(buffer,0,buffer.length))

And I also experienced the problem of missing values of using the BufferedReader instead of BufferedInputStreamReader for a binary data input stream. So, using the BufferedInputStreamReader is a much better in this like case.

Namalak