views:

407

answers:

4

I am crunching through many gigabytes of text data and I was wondering if there is a way to improve performance. For example when going through 10 gigabytes of data and not processing it at all, just iterating line by line, it takes about 3 minutes.

Basically I have a dataIterator wrapper that contains a BufferedReader. I continuously call this iterator, which returns the next line.

Is the problem the number of strings being created? Or perhaps the number of function calls. I don't really know how to profile this application because it get compiled as a jar and used as a STAF service.

Any and all ideas appreciated?

+3  A: 

I think the Java's NIO package would be immensely useful for your needs.

This Wikipedia article has some great background info on the specific improvements over "old" Java I/O.

yalestar
I'll try this out.
esiegel
Not sure using NIO will help the read performance per se. If you read by mapping the file, it may help *indirectly*-- it should stop the reads from trampling the OS file cache.
Neil Coffey
A: 

If the program is launched via a regular "java -options... ClassName args..." command line, you can profile it. I'm most familiar with NetBeans Profiler. It has a way to separately start the java app (adding a java option to the startup) then attach the profiler.

If you're trying to optimize without measuring what needs improvement, you're working in the dark. You might get lucky or you might spend lots of time doing irrelevant work.

John M
I compile the STAF service into a JAR, and then STAF gets called and incorporates the jar file. I tried looking at it with Jconsole, but I wasn't able to connect for some reason. I posted this question on the STAF forum itself, but they weren't much help.
esiegel
+3  A: 

Lets start from the basis: your application is I/O-bound. You are not suffering bad performance due to object allocation, or memory, or CPU limits. Your application is running slowly because of disk access.

If you think you can improve file access, you might need to resort to lower-level programming using the JNI. File access can be improved if you handle it more efficiently by yourself, and that will need to be done on a lower level.

I am not sure that using java.nio will give you better performance by magnitude which you are looking for, although it might give you some more freedom in doing CPU/memory intensive operations while I/O is running.

The reason being is that basically, java.nio wraps the file reading with a selector, letting you be notified when a buffer is read for use, indeed giving you the asynchronous behavior which might help your performance a bit. But reading the file itself is your bottleneck, and java.nio doesn't give you anything in that area.

So try it out first, but I wouldn't keep my hopes too high for it.

Yuval A
Truth. 10GB in 3 minutes is 55MB / second. That's approaching the conventional read performance of platter based disk drives. You might double or triple that in a really good situation, but without a RAM based drive, that's it.
Jherico
Jherico> Or "add an index" or similar.
Tom Hawtin - tackline
I don't follow. If you're IO bound how does having an index help you? It doesn't sound like search is the application, more like log processing or indexing where you're just going through the data sequentially.
Jherico
Update:Using NIO on smaller data (1GB) it doesn't seem much faster, marginally as you said. I'm having a problem using it on large dataset though. I have the data truncated into several 50mb files, and with NIO I memory map the file and then decode it into a char buffer, like the example, but how to I unmap the file. I'm getting java out of heap space error after about a minute of processing. I know the MappedByteBuffers have a limit of 2GB, but each of my files is much smaller.
esiegel
+1  A: 

Using NIO, Channels, byte buffers, and Memory Mapped files will give you the best performance. It's about as close to the hardware as you are going to get. I had a similar problem where I had to parse over 6 million delimited lines of text (265MB file) then move around the delimited columns in the line and then write it back out. Using NIO and 2002 hardware it took 33 seconds to do this. The trick is to leave the data as bytes. You have one thread reading the data to extract the line, another thread to manipulate the line, and a third thread to write it back out.

Javamann