views:

313

answers:

7

I have 100 million lines of data, the data is a word no longer than 15 chars,one word per line. Those data are stored in multiple files.

My goal to to find the unique words among all files.

One solution is to import all words into database and add a unique key for the field. but this is too slow for this large data set.

Is there any faster solution?

Thank you

A: 

If you have to stick with the file structure, then you need some way of indexing the files and then maintaining the index.

Otherwise, I would recommend moving to a database and migrating all operations on that file to work with the database.

casperOne
+1  A: 

I'm not sure that there'll be many faster ways than using a database. Personally, I usually use UNIX shell script for this:

cat * | sort | uniq

I don't know how fast that would be with 100,000,000 words, and I'm not sure how fast you want it to be. (E.g., do you need to run it lots of times or just once? If just once, I'd go with the sort and uniq option and let it run overnight if you can).

Alternatively, you could write a script in ruby or a similar language that stored the words in an associative array. I suspect that would almost certainly be slower than the database approach though.

I guess if you really want speed, and you need to carry out this task (or ones like it) often, then you might want to write something in C, but to me that feels a bit like overkill.

Ben

Ben
If you decide to use the sort command, consider sorting the files individually first using sort -u (for unique) and then using sort -um to merge them together (removing duplicates between files). This helps when the files are large. Also, using ls -fA and xargs -I can help make this easier.
Joseph Bui
A: 

You could store the words in a hashtable. Assuming there are quite a number of duplicates, the O(1) search time will be a big performance boost.

  1. Read a line.
  2. Search for the word in the hashtable.
  3. If not found, add it to the table.
lc
A: 

If you have this much data, then it needs to be in a SQL server. This is why SQL was designed in the first place. If you continue to use these files you will forever be stuck with performance issues.

Even if these files are modified from external programs (or via FTP) you need to create an import process to run nightly.

TravisO
wrong. SQL is optimised to make it easy to read tables and relations. nothing like that on this problem. if this processing is recurrent, it's best to use a specially built DB (probably on top of BDB). if it's one-shot, use sort|uniq
Javier
Very very wrong. It fits in ram on a machine with 4G.
Stephan Eggermont
A: 

If there's significant duplication within individual files, it may be quicker to do it file by file then merge the results. Something along the lines of:

{ for n in * ; do sort -u $n ; done } | sort -u

(I'm assuming GNU bash and GNU sort)

I think the choice of best solution will depend heavily on the distribution of duplicates and the number of separate files, though, which you haven't shared with us.


Given myhusky's clarification (plenty of dupes, 10~20 files), I'll definitely suggest this as a good solution. In particular, dense duplication will speed up sort -u versus sort|uniq

Brent.Longborough
duplicates are heavy, number of files around 10 ~ 20
myhusky
Sounds like this one might just run a bit faster than @Ben's, then. Have you tried it? And if dupes are heavy, sort -u will, I think, run a lot faster tham sort|uniq.
Brent.Longborough
A: 

You can conserve speed, space, or your sanity. Pick any two.

Throwing it all into a database sacrificed both speed and space, as you found out. But it was easy.

If space is your main problem (memory, disk space) then partition the work. Filter all of the 1 character lines from the files and use one of the above solutions (sort, uniq). Repeat with the 2 character lines for each file. And so on. The unique solutions from each pass form your solution set.

If your main problem is speed, then read each file exactly once creating a hash table (dictionary, whatever) to look for duplicates. Depending on the hash implementation, this could eat up bucketloads of memory (or disk). But it'll be fast.

If you need to conserve speed and space, then consider blending the two techniques. But be prepared to sacrifice the third item.

clintp
+1  A: 

Using a database for this is insane. 100 million records of 15 chars fits in ram. If there is at least some duplication, simply build a trie. Should be able to process 50MB/second or so on a modern machine

Stephan Eggermont