views:

45

answers:

3

Hi,

I have two files of data, 100 char lines each. File A: 10^8 lines, file B: 10^6 lines. And I need to find all the strings from file B that are not in file A.
At first I was thinking feeding both files to mysql, but it looks like it won't ever finish creating an unique key on 10^8 records.

I'm waiting for your suggestions on this.

A: 

I recommend mongodb

sirmak
Well, mongodb is not that impressively faster that mysql out of the box.
nweb
+2  A: 

You can perform this operation without a database. The key is to reduce the size of A, since A is much larger than B. Here is how to do this:

Calculate 64-bit hashes using a decent hash function for the strings in the B file. Store these in memory (in a hash table), which you can do because B is small. Then hash all of the strings in your A file, line by line, and see if each one matches a hash for your B file. Any lines with matching hashes (to one from B), should be stored in a file C.

When this process is complete file C will have the small subset of A of potentially matching strings (to B). Now you have a much smaller file C that you need to compare lines of B with. This reduces the problem to a problem where you can actually load all of the lines from C into memory (as a hash table) and compare each line of B to see if it is in C.

Michael Goldshteyn
A: 

For the sizes you mention you should be able to keep all of B in memory at once, so you could do a simplified version of Goldshteyn's answer; something like this in python:

#!/usr/bin/python3

import sys

if __name__=='__main__':
  b = open(sys.argv[2],'r')
  bs = set()
  for l in b:
    bs.add(l.strip())
  b.close()
  a = open(sys.argv[1],'r')
  for l in a:
    l = l.strip()
    if l in bs:
      bs.remove(l)
  for x in bs:
    print(x)

I've tested this on two files of 10^5 and 10^7 in size with ~8 chars per line on an atom processor. Output from /usr/bin/time:

25.15user 0.27system 0:25.80elapsed 98%CPU (0avgtext+0avgdata 56032maxresident)k
0inputs+0outputs (0major+3862minor)pagefaults 0swaps
  60298   60298  509244
Kevin Stock