views:

27

answers:

1

Hey everyone. I have multiple text files that represent logging entries which I need to parse later on. Each of the files is up to 1M in size and I have approximately 10 files. Each line has the following format:

Timestamp\tData

I have to merge all files and sort the entries by the timestamp value. There is no guarantee that the entries of 1 file are in correct chronological order.

What would be the smartest approach? My Pseudo'd code looks like this:

List<FileEntry> oneBigList = new ArrayList<FileEntry>();
for each file {
  parse each line into an instance of FileEntry;
  add the instance to oneBigList;
}
Collections.sort(oneBigList according to FileEntry.getTimestamp());
+1  A: 

If you are not sure that your task will fit into available memory, you are better off inserting your lines after parsing into a database table and have the database worry about how to order the data (an index on the timestamp column will help :-)

If you are sure memory is no problem, I would use a TreeMap to do the sorting while I add the lines to it.

Make sure your FileEntry class implements hashCode(), equals() and Comparable according to your sort order.

rsp
yup, for 10 files of 1MB each, a treemap should be plenty. Actually, a TreeSet, because map functionality is not needed, is it?
seanizer
If you don't need the lookup access a `TreeSet` will do fine, yes.
rsp
I used the TreeSet approach and it's working fine. Small benchmark shows no big difference between Collections.sort() and the TreeSet - 151ms vs 170ms respectively (average of 10 attempts per approach) with 150k test data (including file opening+reading)
f1sh