views:

2836

answers:

12

As part of a project I'm working on, I'd like to clean up a file I generate of duplicate line entries. These duplicates often won't occur near each other, however. I came up with a method of doing so in Java (which basically made a copy of the file, then used a nested while-statement to compare each line in one file with the rest of the other). The problem, is that my generated file is pretty big and text heavy (about 225k lines of text, and around 40 megs). I estimate my current process to take 63 hours! This is definitely not acceptable.

I need an integrated solution for this, however. Preferably in Java. Any ideas? Thanks!

+13  A: 

Hmm... 40 megs seems small enough that you could build a Set of the lines and then print them all back out. This would be way, way faster than doing O(n2) I/O work.

It would be something like this (ignoring exceptions):

public void stripDuplicatesFromFile(String filename) {
    BufferedReader reader = new BufferedReader(new FileReader(filename));
    Set<String> lines = new HashSet<String>(10000); // maybe should be bigger
    String line;
    while ((line = reader.readLine()) != null) {
        lines.add(line);
    }
    reader.close();
    BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
    for (String unique : lines) {
        writer.write(unique);
        writer.newLine();
    }
    writer.close();
}

If the order is important, you could use a LinkedHashSet instead of a HashSet. Since the elements are stored by reference, the overhead of an extra linked list should be insignificant compared to the actual amount of data.

Edit: As Workshop Alex pointed out, if you don't mind making a temporary file, you can simply print out the lines as you read them. This allows you to use a simple HashSet instead of LinkedHashSet. But I doubt you'd notice the difference on an I/O bound operation like this one.

Michael Myers
that's the answer I was going to give
David Johnstone
yea, 40 megs is nothing, read the whole thing into memory, dump it to a hashset to keep only unique lines, write it back to disk.
yx
Depending on the questioner's requirements, you may need to keep track of the line number, because iterating over a HashSet will return the lines in a pretty arbitrary order.
Simon Nickerson
You could initialize the hashset with a value like #lines / 0.75 because HashSet will make a new table and rehash everything if it reaches its default fill-grade of 75%.Another possibility would be to create the HashSet with a fillgrade of 1.0f (100%) and a size that is a bit bigger then your data-count -> "new HashSet(300000, 1.0f)". This way you can avoid expensive rehashing.
Philipp
You could simplify this code by using readLines() and writeLines() from Commons IO's FileUtils, http://commons.apache.org/io/api-release/org/apache/commons/io/FileUtils.html. (I'm not sure whether that would affect the scalability though.)
Jonik
Hmmm, I tried implementing this, but I get the error "java.lang.OutOfMemoryError: Java heap space". I tried increasing the HashSet size, but no good. Ideas? Thanks!
Monster
Pass -Xmx64m (where 64 is the number of megabytes in the heap) to the program at startup, like "java -Xmx64m MyProgram" or "java -Xmx100m -jar MyJar.jar".
Michael Myers
he will most likely need more then 64M of ram.why? 40MB us-ascii-test-file -> 80MB as strings + HashSet overhead + Object overhead + ....I'd go with 512 MB or so :)
Philipp
Ah- but he's not storing duplicate lines, so it depends on how many duplicates there are. (You're likely right, though, and it doesn't hurt to have a larger allocation on a short-running program like this one.)
Michael Myers
Ah, I set the xmx to 512 and it seemed to work. Great fix! Duplicates gone! Thanks guys!
Monster
And as a final aside, I did end up using LinkedHashSet. While order isn't hugely important, it makes tracking things a lot easier. And the overhead is nil. Thanks again everyone!
Monster
+2  A: 

You could use Set in the Collections library to store unique, seen values as you read the file.

Set<String> uniqueStrings = new HashSet<String>();

// read your file, looping on newline, putting each line into variable 'thisLine'

    uniqueStrings.add(thisLine);

// finish read

for (String uniqueString:uniqueStrings) {
  // do your processing for each unique String
  // i.e. System.out.println(uniqueString);
}
Brabster
+2  A: 

Try a simple HashSet that stores the lines you have already read. Then iterate over the file. If you come across duplicates they are simply ignored (as a Set can only contain every element once).

Kevin D.
you're better off with some sort of set rather than a map
David Johnstone
That's why I already fixed it ;)
Kevin D.
I've done something similar in Delphi once, although I had to write my own HashSet class to do this. The only drawback is that you need lots of memory with huge files, which is fine if you do this client-side but not on a server. Basically, the project that needed this managed to read a file of 500k lines and delete all duplicates within two minutes.
Workshop Alex
However, I just read a line, checked if it was in the hash-set and if it wasn't, I would add it and write it to file. Otherwise, I'd just skip to the next line. That way, I'm not reading back from the hashset and best of all: I kept all lines in the same order.
Workshop Alex
+2  A: 

Something like this, perhaps:

BufferedReader in = ...;
Set<String> lines = new LinkedHashSet();
for (String line; (line = in.readLine()) != null;)
    lines.add(line); // does nothing if duplicate is already added
PrintWriter out = ...;
for (String line : lines)
    out.println(line);

LinkedHashSet keeps the insertion order, as opposed to HashSet which (while being slightly faster for lookup/insert) will reorder all lines.

gustafc
+1  A: 

The Hash Set approach is OK, but you can tweak it to not have to store all the Strings in memory, but a logical pointer to the location in the file so you can go back to read the actual value only in case you need it.

Another creative approach is to append to each line the number of the line, then sort all the lines, remove the duplicates (ignoring the last token that should be the number), and then sort again the file by the last token and striping it out in the output.

fortran
A: 

If you could use UNIX shell commands you could do something like the following:

for(i = line 0 to end)
{
    sed 's/\$i//2g' ; deletes all repeats
}

This would iterate through your whole file and only pass each unique occurrence once per sed call. This way you're not doing a bunch of searches you've done before.

samoz
+1  A: 
  • Read in the file, storing the line number and the line: O(n)
  • Sort it into alphabetical order: O(n log n)
  • Remove duplicates: O(n)
  • Sort it into its original line number order: O(n log n)
Simon Nickerson
A: 

There are two scalable solutions, where by scalable I mean disk and not memory based, depending whether the procedure should be stable or not, where by stable I mean that the order after removing duplicates is the same. if scalability isn't an issue, then simply use memory for the same sort of method.

For the non stable solution, first sort the file on the disk. This is done by splitting the file into smaller files, sorting the smaller chunks in memory, and then merging the files in sorted order, where the merge ignores duplicates.

The merge itself can be done using almost no memory, by comparing only the current line in each file, since the next line is guaranteed to be greater.

The stable solution is slightly trickier. First, sort the file in chunks as before, but indicate in each line the original line number. Then, during the "merge" don't bother storing the result, just the line numbers to be deleted.

Then copy the original file line by line, ignoring the line numbers you have stored above.

+1  A: 

If the order does not matter, the simplest way is shell scripting:

<infile sort | uniq > outfile
phihag
A: 

Does it matter in which order the lines come, and how many duplicates are you counting on seeing?

If not, and if you're counting on a lot of dupes (i.e. a lot more reading than writing) I'd also think about parallelizing the hashset solution, with the hashset as a shared resource.

mikek
Not a bad idea, but since the input file is only 40 megabytes I don't think that it will be a problem.
Michael Myers
I guess. But parallelizing stuff is phun! :3
mikek
+8  A: 

Okay, most answers are a bit silly and slow since it involves adding lines to some hashset or whatever and then moving it back from that set again. Let me show the most optimal solution in pseudocode:

Create a hashset for just strings.
Open the input file.
Open the output file.
while not EOF(input)
  Read Line.
  If not(Line in hashSet)
    Add Line to hashset.
    Write Line to output.
  End If.
End While.
Free hashset.
Close input.
Close output.

Please guys, don't make it more difficult than it needs to be. :-) Don't even bother about sorting, you don't need to.

Workshop Alex
+1 for stating the bleeding obvious I should've seen when writing my answer. D'oh! :)
gustafc
True; I was doing it without a temporary file, but it might be a little more efficient with one (no LinkedHashSet necessary). But I'd venture a guess that CPU isn't going to be the bottleneck anyway.
Michael Myers
Er, my comment was directed at Workshop Alex, not gustafc.
Michael Myers
Of course, instead of using an output file, you could output to an unsorted string list, in memory. Then, when you're done adding the input without duplicates, write the string list over the old input file. It does mean you'll be using twice as much memory than with other solutions, but it's still extremely fast.
Workshop Alex
@Workshop Alex: That's basically what I did. Why do you say it uses twice as much memory?
Michael Myers
That's because it stores the strings twice: once in the hash table and once in the string list. (Then again, chances are that both the hashset and string list only store references to strings, in which case it won't eat up that much.)
Workshop Alex
Yes, they do store references. The extra overhead is probably not even enough to notice, at 8 bytes per unique string.
Michael Myers
Simple calculation: 225k lines, times 8 for every reference makes 1.8 megabytes. With two lists, that doubles to 3.6 megabytes. Then again, if 90% are duplicates then you can reduce these number again by 90%...
Workshop Alex
+1  A: 

A similar approach

public void stripDuplicatesFromFile(String filename) {
    IOUtils.writeLines(
        new LinkedHashSet<String>(IOUtils.readLines(new FileInputStream(filename)),
        "\n", new FileOutputStream(filename + ".uniq"));
}
Peter Lawrey
Shouldn't the latter FileInputStream actually be a FileOutputStream? Other than that, +1 for a simplicity and "knowing and using the libraries".
Jonik
Also, it's worth mentioning that IOUtils is from Apache Commons IO (http://commons.apache.org/io/); that probably isn't obvious to every reader.
Jonik
@Jonik, thank for you pointing out those two comments.
Peter Lawrey