views:

74

answers:

5

how would i search through a list with ~5 mil 128bit (or 256, depending on how you look at it) strings quickly and find the duplicates (in python)? i can turn the strings into numbers, but i don't think that's going to help much. since i haven't learned much information theory, is there anything about this in information theory?

and since these are hashes already, there's no point in hashing them again

+1  A: 

Is this array sorted?

I think the fastest solution can be a heap sort or quick sort, and after go through the array, and find the duplicates.

Yorirou
Sorting is not the fastest way. See Wai Yip Tung's answer
gnibbler
+2  A: 

Load them into memory (5M x 64B = 320MB), sort them, and scan through them finding the duplicates.

Paul Hankin
+1  A: 

In Python2.7+ you can use collections.Counter for older Python use collections.deaultdict(int). Either way is O(n).

first make a list with some hashes in it

>>> import hashlib
>>> s=[hashlib.sha1(str(x)).digest() for x in (1,2,3,4,5,1,2)]
>>> s
['5j\x19+y\x13\xb0LTWM\x18\xc2\x8dF\xe69T(\xab', '\xdaK\x927\xba\xcc\xcd\xf1\x9c\x07`\xca\xb7\xae\xc4\xa85\x90\x10\xb0', 'w\xdeh\xda\xec\xd8#\xba\xbb\xb5\x8e\xdb\x1c\x8e\x14\xd7\x10n\x83\xbb', '\x1bdS\x89$s\xa4g\xd0sr\xd4^\xb0Z\xbc 1dz', '\xac4x\xd6\x9a<\x81\xfab\xe6\x0f\\6\x96\x16ZN^j\xc4', '5j\x19+y\x13\xb0LTWM\x18\xc2\x8dF\xe69T(\xab', '\xdaK\x927\xba\xcc\xcd\xf1\x9c\x07`\xca\xb7\xae\xc4\xa85\x90\x10\xb0']

If you are using Python2.7 or later

>>> from collections import Counter
>>> c=Counter(s)
>>> duplicates = [k for k in c if c[k]>1]
>>> print duplicates
['\xdaK\x927\xba\xcc\xcd\xf1\x9c\x07`\xca\xb7\xae\xc4\xa85\x90\x10\xb0', '5j\x19+y\x13\xb0LTWM\x18\xc2\x8dF\xe69T(\xab']

if you are using Python2.6 or earlier

>>> from collections import defaultdict
>>> d=defaultdict(int)
>>> for i in s:
...  d[i]+=1
... 
>>> duplicates = [k for k in d if d[k]>1]
>>> print duplicates
['\xdaK\x927\xba\xcc\xcd\xf1\x9c\x07`\xca\xb7\xae\xc4\xa85\x90\x10\xb0', '5j\x19+y\x13\xb0LTWM\x18\xc2\x8dF\xe69T(\xab']
gnibbler
+2  A: 

If it fits into memeory, use set(). I think it will be faster than sort. O(n log n) for 5 million items is going to cost you.

If it does not fit into memory, say you've lot more than 5 million record, divide and conquer. Break the records at the mid point like 1 x 2^127. Apply any of the above methods. I guess information theory helps by stating that a good hash function will distribute the keys evenly. So the divide by mid point method should work great.

You can also apply divide and conquer even if it fit into memory. Sorting 2 x 2.5 mil records is faster than sorting 5 mil records.

Wai Yip Tung
A: 

You say you have a list of about 5 million strings, and the list may contain duplicates. You don't say (1) what you want to do with the duplicates (log them, delete all but one occurrence, ...) (2) what you want to do with the non-duplicates (3) whether this list is a stand-alone structure or whether the strings are keys to some other data that you haven't mentioned (4) why you haven't deleted duplicates at input time instead building a list containing duplicates.

As a Data Structures and Algorithms 101 exercise, the answer you have accepted is a nonsense. If you have enough memory, detecting duplicates using a set should be faster than sorting a list and scanning it. Note that deleting M items from a list of size N is O(MN). The code for each of the various alternatives is short and rather obvious; why don't you try writing them, timing them, and reporting back?

If this is a real-world problem that you have, you need to provide much more information if you want a sensible answer.

John Machin