views:

141

answers:

5

I have a file of size 2GB which has student records in it. I need to find students based on certain attributes in each record and create a new file with results. The order of the filtered students should be same as in the original file. What's the efficient & fastest way of doing this using Java IO API and threads without having memory issues? The maxheap size for JVM is set to 512MB.

+2  A: 
  1. 2GB for a file is huge, you SHOULD go for a db.
  2. If you really want to use Java I/O API, then try out this: Handling large data files efficiently with Java and this: Tuning Java I/O Performance
zengr
+5  A: 

What kind of file? Text-based, like CSV?

The easiest way would be to do something like grep does: Read the file line by line, parse the line, check your filter criterion, if matched, output a result line, then go to the next line, until the file is done. This is very memory efficient, as you only have the current line (or a buffer a little larger) loaded at the same time. Your process needs to read through the whole file just once.

I do not think multiple threads are going to help much. It would make things much more complicated, and since the process seems to be I/O bound anyway, trying to read the same file with multiple threads probably does not improve throughput.

If you find that you need to do this often, and going through the file each time is too slow, you need to build some kind of index. The easiest way to do that would be to import the file into a DB (can be an embedded DB like SQLite or HSQL) first.

Thilo
Oh boy, I was typing exactly the same answer. It indeed all boils down to do the work just line by line rather than storing the whole thing in Java's memory.
BalusC
+5  A: 

I wouldn't overcomplicate this until you find that the boringly simple way doesn't work for what you need. Essentially you just need to:

  • open input stream to 2GB file, remembering to buffer (e.g. by wrapping with BufferedInputStream)
  • open output stream to filtered file you're going to create
  • read first record from input stream, look at whatever attribute to decide if you "need" it; if you do, write it to output file
  • repeat for remaining records

On one of my test systems with extremely modest hardware, BufferedInputStream around a FileInputStream out of the box read about 500 MB in 25 seconds, i.e. probably under 2 minutes to process your 2GB file, and the default buffer size is basically as good as it gets (see the BufferedInputStream timings I made for more details). I imagine with state of the art hardware it's quite possible the time would be halved.

Whether you need to go to a lot of effort to reduce the 2/3 minutes or just go for a wee while you're waiting for it to run is a decision that you'll have to make depending on your requirements. I think the database option won't buy you much unless you need to do a lot of different processing runs on the same set of data (and there are other solutions to this that don't automatically mean database).

Neil Coffey
+1, esp. for "go for a wee while you're waiting"
Software Monkey
A: 

If the recordsize is of fixed bytelength, you might use a RandomAccessFile and seek to the next position, which would be the next attribute to filter for Consider this example:

Name Age Sex somethn
Paul..17.m..........
Biggi.18.f..........
John..19.m..........
Lisa..21.f..........
01234567890123456789

Recordsize is 20, if you filter for sex=f you go to (n*20 + 9) for row n (counting from zero).

Do you know something about the organisation/sort order of the file? Maybe you can use that. If you need to parse the file often, maybe you can arrange that order by sorting it for the most often searched fields.

The next step might be to create an index of the rows - by the way, do they stay stable, or is the file changed? Often?

The more often the file is searched and updated, the more I would tend towards a database.

user unknown
You seem to assume that characters == bytes.
mklhmnn
You also seem to assume that this will be faster than a buffered reader or stream, which it certainly won't be.
EJP
A: 

I think you should use memory mapped files.This will help you to map the bigger file to a smaller memory.This will act like virtual memory and as far as performance is concerned mapped files are the faster than stream write/read.

Emil