views:

710

answers:

8

I have two very large files (and neither of them would fit in memory). Each file has one string (which doesn't have spaces in it and is either 99/100/101 characters long) on each line.

Update: The strings are not in any sorted order.
Update2: I am working with Java on Windows.

Now I want to figure out the best way to find out all the strings that occur in both the files.

I have been thinking about using external merge sort to sort both the files and then do comparison but I am not sure if that would be the best way to do it. Since the strings are mostly around the same length, I was always wondering if computing some kind of a hash for each string would be a good idea, since that should make comparisons between strings easier, but then that would mean I have to store the hashes computed for the strings I have encountered from the files so far so that they can be used later when comparing them with other strings. I am not able to pin down on what exactly would be the best way. I am looking for your suggestions.

When you suggest a solution, also please state if the solution would work if there were more than 2 files and strings which occur in all of them had to be figured out.

A: 

Is there any order to the data in the files? The reason I ask is that though a line by line comparison would take an eternity, going through one file line by line whilst doing a binary search in the other would be much quicker. This can only work if the data is sorted in a particular way though.

Chris Simpson
A: 

I would load both files into two database tables so that each string in the file became a row in the table and use SQL queries to find duplicate rows using a join.

Jamie Ide
+13  A: 

You haven't said what platform you're working on, so I assume you're working on Windows, but in the unlikely event that you're on a Unix platform, standard tools will do it for you.

sort file1 | uniq > output
sort file2 | uniq >> output
sort file3 | uniq >> output
...
sort output | uniq -d
Leonard
And in the event that you are on a Windows platform, the simplicity of this solution is so great that it's probably worth finding a Unix box, or installing cygwin. This is also how I would solve this.
BigDave
This doesn't tell which strings are the ones repeated in all files, but output the set union of all files.
Seb
uniq -d deletes singly occurring lines and only prints a single copy of duplicated lines.
Christian Witts
+1 for cygwin, and your elegant solution.
elo80ka
This really is the simplest solution - even if you have to install cygwin on windows (which is relatively painless). It will save you so much time compared to rolling your own.
Galghamon
Thank you for this one. But I have to handle this using Java in Windows :(
Skylark
Runtime is here is basically O(n log n) where n is the number of lines in the longest file.
Gregg Lind
If you want to know how many files each line is common to, then the last line can be modified to: sort output | uniq -c
Leonard
+1  A: 

I'd do it as follows (for any number of files):

  • Sort just 1 file (#1).
  • Walk through each line of the next file (#2) and do a binary search on the #1 file (based on the number of lines).
  • If you find the string; write it on another temp file (#temp1).
  • After you finished with #2, sort #temp1 go to #3 and do the same search but this time on #temp1, not #1, which should take much less than the first one as this only has repeated lines.
  • Repeat this process with new temporary files, deleting previous #temp files. Each iteration should take less and less, as the number of repeated lines diminishes.
Seb
A: 

I would sort each file, then use a Balanced Line algorithm, reading one line at a time from one file or the other.

mbeckish
A: 

A hash based solution might look like this (in python pseudocode):

hashes = dict()
for file in files:
    for line in lines:
        h = md5(line)
        hashes[h] += 1

Then loop over again, printing matching lines:

for file in files:
    for line in lines:
        h = md5(line)
        if hashes[h] == nfiles:
            print line
            del hashes[h]  # since we only want each once.

There are two potential problems.

  1. potential hash collisions (which can be mitigated some, but is a risk. )
  2. needs to be able to handle a dict (associative array) of size: |uniq lines in all files|

This is O(lines * cost(md5) ).

(if people a fuller python implementation, it's pretty easy to write, I don't know java though!).

Gregg Lind
+1  A: 

Depending on how similar the entries within one file is, it might be possible to create a Trie (not tree) from it. Using this trie you can iterate the other file and check each entry if it is inside the trie.

When you have more than 2 files, iterate over one file and build a new trie from the matches. This way the last trie you have will contain all the matches that are contained in all files.

martinus
A: 

To do it in windows, its pretty simple .. lets say , you have two files A and B. 'A' files contains the strings you want to search in file B. just open command prompt and use the following command

FINDSTR /G:A B > OUTPUT

this command is pretty fast and can compare two files very efficiently. The file OUTPUT will contain the strings common in A and B.

if you want to perform the OR operations (strings in B other than A) then use

FINDSTR /V /G:A B > OUTPUT

hope it helps

muzammil butt