ansaurus

Question

How to find unique entries in Large data set?

Answer 1

A:

If you have to stick with the file structure, then you need some way of indexing the files and then maintaining the index.

Otherwise, I would recommend moving to a database and migrating all operations on that file to work with the database.

casperOne 2009-01-06 18:08:39

Answer 2

+1 A:

I'm not sure that there'll be many faster ways than using a database. Personally, I usually use UNIX shell script for this:

cat * | sort | uniq

I don't know how fast that would be with 100,000,000 words, and I'm not sure how fast you want it to be. (E.g., do you need to run it lots of times or just once? If just once, I'd go with the sort and uniq option and let it run overnight if you can).

Alternatively, you could write a script in ruby or a similar language that stored the words in an associative array. I suspect that would almost certainly be slower than the database approach though.

I guess if you really want speed, and you need to carry out this task (or ones like it) often, then you might want to write something in C, but to me that feels a bit like overkill.

Ben

Ben 2009-01-06 18:09:20

If you decide to use the sort command, consider sorting the files individually first using sort -u (for unique) and then using sort -um to merge them together (removing duplicates between files). This helps when the files are large. Also, using ls -fA and xargs -I can help make this easier.

Joseph Bui 2009-01-06 18:21:26

Answer 3

A:

You could store the words in a hashtable. Assuming there are quite a number of duplicates, the O(1) search time will be a big performance boost.

Read a line.
Search for the word in the hashtable.
If not found, add it to the table.

lc 2009-01-06 18:22:14

Answer 4

A:

If you have this much data, then it needs to be in a SQL server. This is why SQL was designed in the first place. If you continue to use these files you will forever be stuck with performance issues.

Even if these files are modified from external programs (or via FTP) you need to create an import process to run nightly.

TravisO 2009-01-06 18:26:47

wrong. SQL is optimised to make it easy to read tables and relations. nothing like that on this problem. if this processing is recurrent, it's best to use a specially built DB (probably on top of BDB). if it's one-shot, use sort|uniq

Javier 2009-01-06 19:10:54

Very very wrong. It fits in ram on a machine with 4G.

Stephan Eggermont 2009-01-06 23:14:20

Answer 5

A:

If there's significant duplication within individual files, it may be quicker to do it file by file then merge the results. Something along the lines of:

{ for n in * ; do sort -u $n ; done } | sort -u

(I'm assuming GNU bash and GNU sort)

I think the choice of best solution will depend heavily on the distribution of duplicates and the number of separate files, though, which you haven't shared with us.

Given myhusky's clarification (plenty of dupes, 10~20 files), I'll definitely suggest this as a good solution. In particular, dense duplication will speed up sort -u versus sort|uniq

Brent.Longborough 2009-01-06 19:08:50

duplicates are heavy, number of files around 10 ~ 20

myhusky 2009-01-06 20:43:50

Sounds like this one might just run a bit faster than @Ben's, then. Have you tried it? And if dupes are heavy, sort -u will, I think, run a lot faster tham sort|uniq.

Brent.Longborough 2009-01-07 23:28:45

Answer 6

A:

You can conserve speed, space, or your sanity. Pick any two.

Throwing it all into a database sacrificed both speed and space, as you found out. But it was easy.

If space is your main problem (memory, disk space) then partition the work. Filter all of the 1 character lines from the files and use one of the above solutions (sort, uniq). Repeat with the 2 character lines for each file. And so on. The unique solutions from each pass form your solution set.

If your main problem is speed, then read each file exactly once creating a hash table (dictionary, whatever) to look for duplicates. Depending on the hash implementation, this could eat up bucketloads of memory (or disk). But it'll be fast.

If you need to conserve speed and space, then consider blending the two techniques. But be prepared to sacrifice the third item.

clintp 2009-01-06 20:34:40

Answer 7

+1 A:

Using a database for this is insane. 100 million records of 15 chars fits in ram. If there is at least some duplication, simply build a trie. Should be able to process 50MB/second or so on a modern machine

Stephan Eggermont 2009-01-06 23:21:23

ansaurus

tags:

views:

answers:

How to find unique entries in Large data set?

related questions