tags:

views:

1273

answers:

3

My program receives large CSV files and transforms them to XML files. In order to have better performance I would like to split this files in smaller segments of (for example) 500 lines. What are the available Java libraries for splitting text files?

+2  A: 

What do you intend to do with those data ?

If it is just record by record processing then event oriented (SAX or StaX) parsing will be the way to go. For record by record processing, an existing "pipeline" toolkit may be applicable.

You can pre-process your file with a splitter function like this one or this Splitter.java.

VonC
+3  A: 

I don't understand what you'd be gaining by splitting up the CSV file into smaller ones? With Java, you can read and process the file as you go, you don't have to read it all at once...

Stephane Grenier
I use a commercial B2B translation SW for transforming the CSV file into XML and, this SW does not handle large files very well...
Otavio
How large are your files? I've seen Java apps handle files with millions of lines without skipping a beat. It just depends on how they're coded...
Stephane Grenier
A: 

How are you planning on distributing the work once the files have been split?

I have done something similar to this on a framework called GridGain - it's a grid computing framework which allows you to execute tasks on a grid of computers.

With this in hand you can then use a cache provider such as JBoss Cache to distribute the file to multiple nodes, specify a start and end line number and process. This is outlined in the following GridGain example: http://www.gridgainsystems.com/wiki/display/GG15UG/Affinity+MapReduce+with+JBoss+Cache

Alternatively you could look at something like Hadoop and the Hadoop File System for moving the file between different nodes.

The same concept could be done on your local machine by loading the file into a cache and then assigning certain "chunks" of the file to be worked on by seperate threads. The grid computing stuff really is only for really large problems, or to provide some level of scalability transparently to your solution. You might need to watch out for IO bottlenecks and locks, but a simple thread pool which you dispatch "jobs" into after the file is split could work.

Aidos