ansaurus

Question

Deleting duplicate lines in a file using Java

Answer 1

+13 A:

Hmm... 40 megs seems small enough that you could build a Set of the lines and then print them all back out. This would be way, way faster than doing O(n²) I/O work.

It would be something like this (ignoring exceptions):

public void stripDuplicatesFromFile(String filename) {
    BufferedReader reader = new BufferedReader(new FileReader(filename));
    Set<String> lines = new HashSet<String>(10000); // maybe should be bigger
    String line;
    while ((line = reader.readLine()) != null) {
        lines.add(line);
    }
    reader.close();
    BufferedWriter writer = new BufferedWriter(new FileWriter(filename));
    for (String unique : lines) {
        writer.write(unique);
        writer.newLine();
    }
    writer.close();
}

If the order is important, you could use a LinkedHashSet instead of a HashSet. Since the elements are stored by reference, the overhead of an extra linked list should be insignificant compared to the actual amount of data.

Edit: As Workshop Alex pointed out, if you don't mind making a temporary file, you can simply print out the lines as you read them. This allows you to use a simple HashSet instead of LinkedHashSet. But I doubt you'd notice the difference on an I/O bound operation like this one.

Michael Myers 2009-06-15 13:18:08

that's the answer I was going to give

David Johnstone 2009-06-15 13:19:50

yea, 40 megs is nothing, read the whole thing into memory, dump it to a hashset to keep only unique lines, write it back to disk.

yx 2009-06-15 13:21:42

Depending on the questioner's requirements, you may need to keep track of the line number, because iterating over a HashSet will return the lines in a pretty arbitrary order.

Simon Nickerson 2009-06-15 13:25:12

You could initialize the hashset with a value like #lines / 0.75 because HashSet will make a new table and rehash everything if it reaches its default fill-grade of 75%.Another possibility would be to create the HashSet with a fillgrade of 1.0f (100%) and a size that is a bit bigger then your data-count -> "new HashSet(300000, 1.0f)". This way you can avoid expensive rehashing.

Philipp 2009-06-15 13:30:17

You could simplify this code by using readLines() and writeLines() from Commons IO's FileUtils, http://commons.apache.org/io/api-release/org/apache/commons/io/FileUtils.html. (I'm not sure whether that would affect the scalability though.)

Jonik 2009-06-15 13:32:20

Hmmm, I tried implementing this, but I get the error "java.lang.OutOfMemoryError: Java heap space". I tried increasing the HashSet size, but no good. Ideas? Thanks!

Monster 2009-06-15 13:33:40

Pass -Xmx64m (where 64 is the number of megabytes in the heap) to the program at startup, like "java -Xmx64m MyProgram" or "java -Xmx100m -jar MyJar.jar".

Michael Myers 2009-06-15 13:42:46

he will most likely need more then 64M of ram.why? 40MB us-ascii-test-file -> 80MB as strings + HashSet overhead + Object overhead + ....I'd go with 512 MB or so :)

Philipp 2009-06-15 13:55:35

Ah- but he's not storing duplicate lines, so it depends on how many duplicates there are. (You're likely right, though, and it doesn't hurt to have a larger allocation on a short-running program like this one.)

Michael Myers 2009-06-15 13:59:11

Ah, I set the xmx to 512 and it seemed to work. Great fix! Duplicates gone! Thanks guys!

Monster 2009-06-15 14:05:47

And as a final aside, I did end up using LinkedHashSet. While order isn't hugely important, it makes tracking things a lot easier. And the overhead is nil. Thanks again everyone!

Monster 2009-06-15 14:13:42

Answer 2

+2 A:

You could use Set in the Collections library to store unique, seen values as you read the file.

Set<String> uniqueStrings = new HashSet<String>();

// read your file, looping on newline, putting each line into variable 'thisLine'

    uniqueStrings.add(thisLine);

// finish read

for (String uniqueString:uniqueStrings) {
  // do your processing for each unique String
  // i.e. System.out.println(uniqueString);
}

Brabster 2009-06-15 13:18:23

Answer 3

+2 A:

Try a simple HashSet that stores the lines you have already read. Then iterate over the file. If you come across duplicates they are simply ignored (as a Set can only contain every element once).

Kevin D. 2009-06-15 13:19:18

you're better off with some sort of set rather than a map

David Johnstone 2009-06-15 13:20:34

That's why I already fixed it ;)

Kevin D. 2009-06-15 13:23:36

I've done something similar in Delphi once, although I had to write my own HashSet class to do this. The only drawback is that you need lots of memory with huge files, which is fine if you do this client-side but not on a server. Basically, the project that needed this managed to read a file of 500k lines and delete all duplicates within two minutes.

Workshop Alex 2009-06-15 13:44:50

However, I just read a line, checked if it was in the hash-set and if it wasn't, I would add it and write it to file. Otherwise, I'd just skip to the next line. That way, I'm not reading back from the hashset and best of all: I kept all lines in the same order.

Workshop Alex 2009-06-15 13:46:27

Answer 4

+2 A:

Something like this, perhaps:

BufferedReader in = ...;
Set<String> lines = new LinkedHashSet();
for (String line; (line = in.readLine()) != null;)
    lines.add(line); // does nothing if duplicate is already added
PrintWriter out = ...;
for (String line : lines)
    out.println(line);

LinkedHashSet keeps the insertion order, as opposed to HashSet which (while being slightly faster for lookup/insert) will reorder all lines.

gustafc 2009-06-15 13:20:46

Answer 5

+1 A:

The Hash Set approach is OK, but you can tweak it to not have to store all the Strings in memory, but a logical pointer to the location in the file so you can go back to read the actual value only in case you need it.

Another creative approach is to append to each line the number of the line, then sort all the lines, remove the duplicates (ignoring the last token that should be the number), and then sort again the file by the last token and striping it out in the output.

fortran 2009-06-15 13:21:39

Answer 6

A:

If you could use UNIX shell commands you could do something like the following:

for(i = line 0 to end)
{
    sed 's/\$i//2g' ; deletes all repeats
}

This would iterate through your whole file and only pass each unique occurrence once per sed call. This way you're not doing a bunch of searches you've done before.

samoz 2009-06-15 13:21:39

Answer 7

+1 A:

Read in the file, storing the line number and the line: O(n)
Sort it into alphabetical order: O(n log n)
Remove duplicates: O(n)
Sort it into its original line number order: O(n log n)

Simon Nickerson 2009-06-15 13:23:35

Answer 8

A:

There are two scalable solutions, where by scalable I mean disk and not memory based, depending whether the procedure should be stable or not, where by stable I mean that the order after removing duplicates is the same. if scalability isn't an issue, then simply use memory for the same sort of method.

For the non stable solution, first sort the file on the disk. This is done by splitting the file into smaller files, sorting the smaller chunks in memory, and then merging the files in sorted order, where the merge ignores duplicates.

The merge itself can be done using almost no memory, by comparing only the current line in each file, since the next line is guaranteed to be greater.

The stable solution is slightly trickier. First, sort the file in chunks as before, but indicate in each line the original line number. Then, during the "merge" don't bother storing the result, just the line numbers to be deleted.

Then copy the original file line by line, ignoring the line numbers you have stored above.

2009-06-15 13:25:17

Answer 9

+1 A:

If the order does not matter, the simplest way is shell scripting:

<infile sort | uniq > outfile

phihag 2009-06-15 13:26:08

Answer 10

A:

Does it matter in which order the lines come, and how many duplicates are you counting on seeing?

If not, and if you're counting on a lot of dupes (i.e. a lot more reading than writing) I'd also think about parallelizing the hashset solution, with the hashset as a shared resource.

mikek 2009-06-15 13:45:28

Not a bad idea, but since the input file is only 40 megabytes I don't think that it will be a problem.

Michael Myers 2009-06-15 13:47:33

I guess. But parallelizing stuff is phun! :3

mikek 2009-06-15 13:52:17

Answer 11

+8 A:

Okay, most answers are a bit silly and slow since it involves adding lines to some hashset or whatever and then moving it back from that set again. Let me show the most optimal solution in pseudocode:

Create a hashset for just strings.
Open the input file.
Open the output file.
while not EOF(input)
  Read Line.
  If not(Line in hashSet)
    Add Line to hashset.
    Write Line to output.
  End If.
End While.
Free hashset.
Close input.
Close output.

Please guys, don't make it more difficult than it needs to be. :-) Don't even bother about sorting, you don't need to.

Workshop Alex 2009-06-15 13:52:24

+1 for stating the bleeding obvious I should've seen when writing my answer. D'oh! :)

gustafc 2009-06-15 14:20:04

True; I was doing it without a temporary file, but it might be a little more efficient with one (no LinkedHashSet necessary). But I'd venture a guess that CPU isn't going to be the bottleneck anyway.

Michael Myers 2009-06-15 14:21:05

Er, my comment was directed at Workshop Alex, not gustafc.

Michael Myers 2009-06-15 14:22:15

Of course, instead of using an output file, you could output to an unsorted string list, in memory. Then, when you're done adding the input without duplicates, write the string list over the old input file. It does mean you'll be using twice as much memory than with other solutions, but it's still extremely fast.

Workshop Alex 2009-06-16 09:39:17

@Workshop Alex: That's basically what I did. Why do you say it uses twice as much memory?

Michael Myers 2009-06-16 14:18:19

That's because it stores the strings twice: once in the hash table and once in the string list. (Then again, chances are that both the hashset and string list only store references to strings, in which case it won't eat up that much.)

Workshop Alex 2009-06-16 21:15:37

Yes, they do store references. The extra overhead is probably not even enough to notice, at 8 bytes per unique string.

Michael Myers 2009-06-16 22:22:48

Simple calculation: 225k lines, times 8 for every reference makes 1.8 megabytes. With two lists, that doubles to 3.6 megabytes. Then again, if 90% are duplicates then you can reduce these number again by 90%...

Workshop Alex 2009-06-18 15:44:32

Answer 12

+1 A:

A similar approach

public void stripDuplicatesFromFile(String filename) {
    IOUtils.writeLines(
        new LinkedHashSet<String>(IOUtils.readLines(new FileInputStream(filename)),
        "\n", new FileOutputStream(filename + ".uniq"));
}

Peter Lawrey 2009-06-16 20:30:07

Shouldn't the latter FileInputStream actually be a FileOutputStream? Other than that, +1 for a simplicity and "knowing and using the libraries".

Jonik 2009-06-23 20:43:43

Also, it's worth mentioning that IOUtils is from Apache Commons IO (http://commons.apache.org/io/); that probably isn't obvious to every reader.

Jonik 2009-06-23 20:46:14

@Jonik, thank for you pointing out those two comments.

Peter Lawrey 2009-06-25 17:49:18

ansaurus

tags:

views:

answers:

Deleting duplicate lines in a file using Java

related questions