ansaurus

Question

How to find common strings among two very large files?

Answer 1

A:

Is there any order to the data in the files? The reason I ask is that though a line by line comparison would take an eternity, going through one file line by line whilst doing a binary search in the other would be much quicker. This can only work if the data is sorted in a particular way though.

Chris Simpson 2009-03-18 14:05:42

Answer 2

A:

I would load both files into two database tables so that each string in the file became a row in the table and use SQL queries to find duplicate rows using a join.

Jamie Ide 2009-03-18 14:14:27

Answer 3

+13 A:

You haven't said what platform you're working on, so I assume you're working on Windows, but in the unlikely event that you're on a Unix platform, standard tools will do it for you.

sort file1 | uniq > output
sort file2 | uniq >> output
sort file3 | uniq >> output
...
sort output | uniq -d

Leonard 2009-03-18 14:14:54

And in the event that you are on a Windows platform, the simplicity of this solution is so great that it's probably worth finding a Unix box, or installing cygwin. This is also how I would solve this.

BigDave 2009-03-18 14:19:10

This doesn't tell which strings are the ones repeated in all files, but output the set union of all files.

Seb 2009-03-18 14:19:20

uniq -d deletes singly occurring lines and only prints a single copy of duplicated lines.

Christian Witts 2009-03-18 14:34:35

+1 for cygwin, and your elegant solution.

elo80ka 2009-03-18 14:35:32

This really is the simplest solution - even if you have to install cygwin on windows (which is relatively painless). It will save you so much time compared to rolling your own.

Galghamon 2009-03-18 14:54:53

Thank you for this one. But I have to handle this using Java in Windows :(

Skylark 2009-03-18 15:19:48

Runtime is here is basically O(n log n) where n is the number of lines in the longest file.

Gregg Lind 2009-03-18 15:21:21

If you want to know how many files each line is common to, then the last line can be modified to: sort output | uniq -c

Leonard 2009-03-18 15:23:27

Answer 4

+1 A:

I'd do it as follows (for any number of files):

Sort just 1 file (#1).
Walk through each line of the next file (#2) and do a binary search on the #1 file (based on the number of lines).
If you find the string; write it on another temp file (#temp1).
After you finished with #2, sort #temp1 go to #3 and do the same search but this time on #temp1, not #1, which should take much less than the first one as this only has repeated lines.
Repeat this process with new temporary files, deleting previous #temp files. Each iteration should take less and less, as the number of repeated lines diminishes.

Seb 2009-03-18 14:33:02

Answer 5

A:

I would sort each file, then use a Balanced Line algorithm, reading one line at a time from one file or the other.

mbeckish 2009-03-18 14:43:17

Answer 6

A:

A hash based solution might look like this (in python pseudocode):

hashes = dict()
for file in files:
    for line in lines:
        h = md5(line)
        hashes[h] += 1

Then loop over again, printing matching lines:

for file in files:
    for line in lines:
        h = md5(line)
        if hashes[h] == nfiles:
            print line
            del hashes[h]  # since we only want each once.

There are two potential problems.

potential hash collisions (which can be mitigated some, but is a risk. )
needs to be able to handle a dict (associative array) of size: |uniq lines in all files|

This is O(lines * cost(md5) ).

(if people a fuller python implementation, it's pretty easy to write, I don't know java though!).

Gregg Lind 2009-03-18 15:36:37

Answer 7

+1 A:

Depending on how similar the entries within one file is, it might be possible to create a Trie (not tree) from it. Using this trie you can iterate the other file and check each entry if it is inside the trie.

When you have more than 2 files, iterate over one file and build a new trie from the matches. This way the last trie you have will contain all the matches that are contained in all files.

martinus 2009-03-20 13:08:17

Answer 8

A:

To do it in windows, its pretty simple .. lets say , you have two files A and B. 'A' files contains the strings you want to search in file B. just open command prompt and use the following command

FINDSTR /G:A B > OUTPUT

this command is pretty fast and can compare two files very efficiently. The file OUTPUT will contain the strings common in A and B.

if you want to perform the OR operations (strings in B other than A) then use

FINDSTR /V /G:A B > OUTPUT

hope it helps

muzammil butt 2009-11-08 12:58:31

ansaurus

tags:

views:

answers:

How to find common strings among two very large files?

related questions