tags:

views:

1457

answers:

7

I've got a CSV file that I'm processing using the opencsv library. So I can read in each line. The particular transformation I need to do requires me to sort that file first before I run through it with the main portion of my java file.

e.g.

5423, blah2, blah
5323, blah3, blah
5423, blah4, blah
5444, blah5, blah
5423, blah6, blah

should become

5323, blah3, blah
5423, blah2, blah
5423, blah4, blah
5423, blah6, blah
5444, blah5, blah

etc..

The reason i need to do this is I'm combining all rows with the same id and outputting them to a new file.

Anything wrong with:

  1. Read each line of the csv with the opencsv library

  2. Add them to a 2 dimensional array

  3. Run some sort of sorting on this

  4. Loop through sorted array and output to file.

Any other ideas on this and what is the best way to sort the data?

Bit rusty on my Java.

UPDATE: To Clarify on the final output

It would look like:

5323, blah3, blah
5423, blah2!!blah4!!blah6, blah
5444, blah5, blah

This is a very simplified version of what I'm doing. It actually is needed for multi option fields in a JBase system. This is the requested file format.

There are over a 100,000 lines in the original file.

This will be run more than once and the speed it runs is important to me.

+5  A: 

To accomplish the most recent request, I would highly suggest using Multimap in the google collection. Your code would look like:

CSVReader reader = ...;
CSVWriter writer = ...;

Multimap<String, String> results = TreeMultimap.create();

// read the file
String[] line;
for ((line = reader.readNext()) != null) {
    results.put(line[0], line[1]);
}

// output the file
Map<String, Collection<String>> mapView = results.asMap();
for (Map.Entry<String, Collection<String> entry : mapView.entries()) {
    String[] nextLine = new String[2];
    nextLine[0] = entry.getKey();
    nextLine[1] = formatCollection(entry.getValue());
    writer.writeNext(nextLine);
}

You need to use "blah\n" as your line ender. If you care about speed, but not so much about having the entries sorted, you should benchmark against HashMultimap as well.

My previous answer:

The most straightford way is to use the sort command in *nix (e.g. Linux and Mac OS), like

sort -n myfile.csv

Windows has a sort command as well, but would sort the lines alphabetically (i.e. '5,' would be placed before '13,' lines).

However, there is nothing wrong with the suggested solution. Instead of constructing the array and sorting it, you can also just use TreeSet.

EDIT: adding a note about Windows.

notnoop
+1 I like the sort command call approach.
ATorras
+1 TreeSet. Could be SortedSet as well.
Tom
SortedSet is the interface. TreeSet is the implementation! The Collections API has some gems that aren't commonly known or used unfortunately.
notnoop
This would work...unless of course he doesn't use a *nix machine.
Peter
Apparently Windows (at least up to XP) has a sort command as well: http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/sort.mspx?mfr=true
notnoop
This is good if you want to do it on the commandline... but bad if you want to do it programatically. Also, doesn't this approach use a lexicographic sort... which may not have the desired results?
Tom
+1  A: 

Hi,

Have you tried using Collections.sort() and a Comparator instance?

Regards.

ATorras
sorting seems unnecessary for the end goal.
Tom
Apart from of course the "Run some sort of sorting on this" part of the question.
Tom
Oh dear... two Tom's is going to get confusing :-P. @other Tom: I realize the OP asked for a way to sort, but he also explained what he is trying to do and asked for other ideas... I think providing an O(n) that scales better instead of an O(nlogn) solution falls into that category :-). I did not -1 this response because it is technically correct and answers part of the question. However, I did not +1 it, because it is most likely not the best solution, unless the OP is dealing with an extremely small data set.
Tom
I agree with Tom (hah!) - I keep seeing people wanting to sort lists when they really want to do something else. Sure - sorting can do this - but it's overkill. If this is just quick and dirty work, then getting the code written fast is more important than how fast it runs, but you've got to walk into it with a big comment saying 'this is sloppy and totally non-optimized, but it gets the job done'. Then if profiling turns that up as a hot spot, go bang on it.
Kevin Day
well, the reason i want to sort it is so I can group the lines with he same code number, then loop through the array/tree/list and combine the rows with the same codes into one row. By combine i mean one of the other columns will have a combined string..e.g "blah2!!blah4!!blah6" and a single line for code 5423.It seemed to me, sorting them was the best way to go with it.
Derek Organ
A: 

You could just use a single dimensioned ArrayList (or other collection) and have Java do sorting on it using Collections sort method. Everything else you described sounds pretty standard, though.

Peter
sorting seems unnecessary for the end goal.
Tom
A: 

You say you need to "sort" the items, but your description sounds as if you need to group them. This could be done many ways; you might want to look into multimaps such as those offered by google collections; or you could simply create a

HashMap<Long, List<String>>

and place each line into the relevant list as you read it. My preference in cases like this is two passes through the file, once to add a new ArrayList to each key, and a second pass to add each string to the list, but it's probably more efficient (just less simple) to use a single pass, wherein you check to see if the list is already in the map.

Carl Manaster
A: 

It sounds like you don't need to sort the entire thing. I am not sure how many lines you are going to have, but it seems like you could use some sort of hash based scheme. You can think of your files as buckets in a hashmap and after reading each line, determine which file it belongs to. Then you can further process each file. There are a couple ways you can do this.

  • If you won't have a lot of "keys", you can actually just keep all the keys in memory as keys in a hash map of string => string (A map that maps the key to filename the line belongs in).

  • If there are too many possible keys to keep in memory. You can try to bucket the lines into different files to help reduce the size of the files. Then you can keep each file in memory, which would allow you to dump the lines to a collection and sort. Or possibly use the first scheme I mentioned.

Does this make sense? I can probably elaborate more if you are confused. I imagine your keys will be made by somehow combining all the columns of your csv line.

This approach will be more scalable if your files get really big. You don't want to depend on having the entire file in memory, and sorting takes O(nlogn) time, whereas in theory, the hashing scheme is just O(n).

Tom
A: 

FlatPack is great for reading in files like that and sorting them. It also has options for exporting a data set to a file.

Owen
+1  A: 

If you are only interested in sorting on the id, and aren't bothered about the ordering within that id, you could simply combine a MultiValueMap from Commons Collections with a TreeMap:

MultiValueMap m = MultiValueMap.decorate(new TreeMap());

m.put(2, "B");
m.put(3, "Y");
m.put(1, "F");
m.put(1, "E");
m.put(2, "K");
m.put(4, "Q");
m.put(3, "I");
m.put(1, "X");

for(Iterator iter = m.entrySet().iterator(); iter.hasNext(); ) {
    final Map.Entry entry = (Map.Entry)iter.next();
    System.out.println(entry.getKey() + ": " + entry.getValue());
}

Running this gives:

1: [F, E, X]
2: [B, K]
3: [Y, I]
4: [Q]

There is an overloaded decorate method which let you specify the collection type to use in the MultiValueMap. You could do something with this if you need to sort within the ID.

A_M