views:

236

answers:

6

I have a set of lots of big long strings that I want to do existence lookups for. I don't need the whole string ever to be saved. As far as i can tell, the set() actually stored the string which is eating up a lot of my memory.

Does such a data structure exist?

done = hash_only_set()
while len(queue) > 0 :
   item = queue.pop()
   if item not in done :
      process(item)
      done.add(item)

(My queue is constantly being filled by other threads so I have no way of dedupping it at the start).

+7  A: 

It's certainly possible to keep a set of only hashes:

done = set()
while len(queue) > 0 :
   item = queue.pop()
   h = hash(item)
   if h not in done :
      process(item)
      done.add(h)

Notice that because of hash collisions, there is a chance that you consider an item done even though it isn't.

If you cannot accept this risk, you really need to save the full strings to be able to tell whether you have seen it before. Alternatively: perhaps the processing itself would be able to tell?

Yet alternatively: if you cannot accept to keep the strings in memory, keep them in a database, or create files in a directory with the same name as the string.

Martin v. Löwis
would your method store the hash twice? once as the key in the table, and once as the value?
Paul Tarjan
Using a hash... Very clever :-)
e-satis
Using the builtin hash will you have a lot of chances to find collisions (it's only 32 bits).
tonfa
e-satis: set() always uses the hash (but not only the hash), set(), just like dict() is a hash map.
kaizer.se
@Paul: since it's a set, no it wouldn't store it twice.
tonfa
@Paul: On a 32-bit system, each item will take 20 bytes: 8 bytes in the set, and 12 bytes for the integer object itself. This will indeed store the value twice (once in the setentry, once in the int object); the remaining 12 bytes are 3 pointers.
Martin v. Löwis
@tonfa: more precisely, because of the birthday paradox, the chance of a collision is 50% when you have ca. 77,000 strings.
Martin v. Löwis
@martin, my favorite rule of thumb is that you worry about collision ("high/significant risk thereof") around the sqrt of the number of different hash values (about 64K for otherwise-perfect 32-bit hashes). Yes you can compute it precisely, but this back-of-the-envelope approach lets you eyeball how large a hash-space you need easily (more than the square of the number of items you expect;-).
Alex Martelli
@alex: sure, using the square root for the birthday paradox is a wide-spread rule of thumb. However, it is also often misunderstood, so I decided to compute it more correctly (IIUC, the actual value for the 50% boundary would be 77163, for 2**32 evenly distributed hash values).
Martin v. Löwis
+4  A: 

You can use a data structure called Bloom Filter specifically for this purpose. A Python implementation can be found here.

EDIT: Important notes:

  1. False positives are possible in this data structure, i.e. a check for the existence of a string could return a positive result even though it was not stored.
  2. False negatives (getting a negative result for a string that was stored) are not possible.

That said, the chances of this happening can be brought to a minimum if used properly and so I consider this data structure to be very useful.

spatz
With a bloom filter, false positive are possible, which I think rules it out for what the OP wants. I'm sure he wouldn't want one of his items not to be processed because of a false positive.
Virgil Dupras
All solutions which do not store the entire strings will have a chance of false positives, but this one has low memory usage and can be adjusted to your needs.
spatz
Thanks for the literature.
kaizer.se
+1  A: 

You would have to think about how to do the lookup, since there are two methods that the set needs, __hash__ and __eq__.

The hash is a "loose part" that you can take away, but the __eq__ is not a loose part that you can save; you have to have two strings for the comparison.

If you only need negative confirmation (this item is not part of the set), you could fill a Set collection you implemented yourself with your strings, then you "finalize" the set by removing all strings, except those with collisions (those are kept around for eq tests), and you promise not to add more objects to your Set. Now you have an exclusive test available.. you can tell if an object is not in your Set. You can't be certain if "obj in Set == True" is a false positive or not.

Edit: This is basically a bloom filter that was cleverly linked, but a bloom filter might use more than one hash per element which is really clever.

Edit2: This is my 3-minute bloom filter:

class BloomFilter (object):
    """ 
    Let's make a bloom filter
    http://en.wikipedia.org/wiki/Bloom_filter

    __contains__ has false positives, but never false negatives
    """ 
    def __init__(self, hashes=(hash, )): 
        self.hashes = hashes
        self.data = set()
    def __contains__(self, obj):
        return all((h(obj) in self.data) for h in self.hashes)
    def add(self, obj):
        self.data.update(h(obj) for h in self.hashes)
kaizer.se
+2  A: 

You need to know the whole string to have 100% certainty. If you have lots of strings with similar prefixes you could save space by using a trie to store the strings. If your strings are long you could also save space by using a large hash function like SHA-1 to make the possibility of hash collisions so remote as to be irrelevant.

If you can make the process() function idempotent - i.e. having it called twice on an item is only a performance issue, then the problem becomes a lot simpler and you can use lossy datastructures, such as bloom filters.

Ants Aasma
This is a very very good suggestion. You could save all the string memory just for the outside (promille or less?) extra CPU cost.
kaizer.se
+3  A: 

If you use a secure (like SHA-256, found in the hashlib module) hash function to hash the strings, it's very unlikely that you would found duplicate (and if you find some you can probably win a prize as with most cryptographic hash functions).

The builtin __hash__() method does not guarantee you won't have duplicates (and since it only uses 32 bits, it's very likely you'll find some).

tonfa
If Python's string hash holds up, it might be reasonable to use the string hash with <65000 strings: http://stackoverflow.com/questions/1303021/shortest-hash-in-python-to-name-cache-files/1303619#1303619
kaizer.se
Using a secure hash is not necessary. Secure != low probability of collision. Secure just means it is unfeasible to produce a certain hash with "wrong" data.
truppo
@truppo If you look at http://en.wikipedia.org/wiki/Cryptographic_hash_function you'll see that the low probability of collision is part of the properties of an ideal cryptographic hash.
tonfa
@kaizer.se: He said he had *a lot* of strings :)
tonfa
64K strings of length, say 100 characters => at least 6400 Kilobytes; not that prohibitive and it is probably much more to be a memory problem so you are right.
kaizer.se
A: 

As has been hinted already, if the answers offered here (most of which break down in the face of hash collisions) are not acceptable you would need to use a lossless representation of the strings.

Python's zlib module provides built-in string compression capabilities and could be used to pre-process the strings before you put them in your set. Note however that the strings would need to be quite long (which you hint that they are) and have minimal entropy in order to save much memory space. Other compression options might provide better space savings and some Python based implementations can be found here

Evan Grim