views:

90

answers:

2

What is the best way to remove duplicate lines from large .txt files like 1 GB and more ?

Because removing one-after-another duplicates is simple, we can turn this problem to just sorting file.

Assume, that we can't load whole data to RAM, because of it's size.

I'm just waiting to retreive all records from SQL table with one unique index field (I loaded file lines to table earlier) and wondering, does exists way to speed it up.

A: 

read the file two bytes at a time. if those two bytes are a new line (\r\n) then flag you have a new line. Now read the next two and if they a new line then keep the flag but delete that new line (when I say delete that means omit writing the new line to your temporary file). now if you encounter another new line it will get delete again but if not then reset the flag. then copy the contents of your temporary file into the original and you're done.

you can also be reading in 1 byte at a time if you're looking for a single (\n). or you could be reading in 1 KB of the file at a time and then doing those operations in memory (this would be much faster).

ExtremeCoder
I mean delete duplicate lines (whole lines with it content), not "double newline sequences".
killer_PL
What do you think an empty line is?
ExtremeCoder
+1  A: 

You could try a bloom filter. While you may get some false positives (though you can get arbitrarily close to 0% at the cost of more processing) it should be pretty fast as you don't need to compare or even do a log(n) search for each line you see.

Paul Rubel
With a bloom filter or other hash-functions you can find the possible duplicates and compare and remove later.
Floyd