tags:

views:

504

answers:

5

What is the most memory efficient way to remove duplicate lines in a large text file using C++?

Let me clarify, I'm not asking for code, just the best method. The duplicate lines are not guaranteed to be adjacent. I realize that an approach optimized for minimal memory usage would result in slower speeds however this is my restriction as the files are far too large.

A: 

To minimize memory usage:

If you have unlimited (or very fast) disk i/o, you could write each line to its own file with the filename being the hash + some identifier indicating order (or no order, if order is irrelevant). In this manner, you use the file-system as extended memory. This should be much faster than re-scanning the whole file for each line.

As an addition from what those have said below, if you expect a high duplicate rate, you could maintain some threshold of the hashes in memory as well as in file. This would give much better results for high duplicate rates. Since the file is so large, I doubt n^2 is acceptable in processing time. My solution is O(n) in processing speed and O(1) in memory. It's O(n) in required disk space used during runtime, however, which other solutions do not have.

It sounds like you might be running on limited hardware of varied specs, so you'll want to test a number of duplicate removing algorithms and profile before you decide which is best for long-term implementation.

Stefan Kendall
If your hashing algorithm is fast and dumb (high collision rate), run KMP or some other faster-than-linear string matching algorithm.
Stefan Kendall
+3  A: 

i would hash each line and then seek back to lines that have non-unique hashes and compare them individually (or in a buffered fashion). this would work well on files with a relatively low occurence of duplicates.

When you use a hash, you can set the memory used to a constant amount (i.e., you could have a tiny hash table with just 256 slots or something larger. in any case, the amount of mem can be restricted to any constant amount.) the values in the table are the offset of the lines with that hash. so you only need line_count*sizeof(int) plus a constant to maintain the hash table.

even simpler (but much slower) would be to scan the entire file for each line. but i prefer the first option. this is the most memory efficient option possible. you would only need to store 2 offsets and 2 bytes to do the comparison.

A: 

You could use an I/O efficient sort(like the unix sort command) and then read the file in line by line comparing each line to the one previously read. If the two are equal don't output anything if they aren't output the line.

This way the amount of memory used by the algorithm is constant.

Anatoly Fayngelerin
You can't look at the unix command as outside of the algorithm. It's part of the process and so has to be counted. Given that it has to store the sort order of the items, it would have to use linear memory.
santosc
I suggested using something LIKE the unix sort command.In fact, I/O efficient sort can be implemented to run with constant memory usage:Break up the file into chunks of size c. Read in each chunk of size c and sort it. Output results to temp file. Merge sorted chunks using a read buffer of size c.For more information on external(I/O efficient sorting) see http://en.wikipedia.org/wiki/External_sorting, as well as the Knuth book.
Anatoly Fayngelerin
I was under the impression that memory of any kind (external or RAM) could be used. If he can used disk space, then, absolutely, your approach is viable.
santosc
Given the fact that the hard disk space use implied by the problem statement is linear with the size of the input(e.g. output size = input size, in the worst case), using an external merge sort would not be a huge change in disk space use.
Anatoly Fayngelerin
A: 

Simple brute-force solution (very little memory consumption): Do an n^2 pass through the file and remove duplicate lines. Speed: O(n^2), Memory: constant

Fast (but poor, memory consumption): Stefan Kendall's solution: hash each line, store them in a map of some sort and remove a line whose has already exists. Speed: O(n), memory: O(n)

If you're willing to sacrifice file order (I assume not, but I'll add it): You can sort the lines, then pass through removing duplicates. speed: O(n*log(n)), Memory: constant

edit: If you dont like the idea of sorting the file contents or trying to maintain unique hashes but can handle O(n) memory usage: You can identify each line with it's 32 bit or 64 bit position marker (depending on the file's size) and sort the file positions instead of the file contents.

edit #2: caveat: in-memory sorting lines of different lengths is harder than doing it to say, an array of ints...actually, thinking about how the memory would have to shift and move in a merge step, I'm second guessing my ability to sort a file like that in n*log(n)

santosc
It is unclear from your answer, how sorting a file a size N can be done by using a constant amount of memory.
Anatoly Fayngelerin
+1  A: 

Why not just consult Knuth, Sorting and Searching? That will give you a great background for making a balanced decision.

Permaquid