ansaurus

Question

Python, Huge Iteration Performance Problem

Answer 1

+10 A:

def slices(seq, length, prefer_last=False):
  unique = {}
  if prefer_last: # this doesn't have to be a parameter, just choose one
    for start in xrange(len(seq) - length + 1):
      unique[seq[start:start+length]] = start
  else: # prefer first
    for start in xrange(len(seq) - length, -1, -1):
      unique[seq[start:start+length]] = start
  return unique

# or find all locations for each slice:
import collections
def slices(seq, length):
  unique = collections.defaultdict(list)
  for start in xrange(len(seq) - length + 1):
    unique[seq[start:start+length]].append(start)
  return unique

This function (currently in my iter_util module) is O(n) (n being the length of each word) and you would use set(slices(..)) (with set operations such as difference) to get slices unique across all words (example below). You could also write the function to return a set, if you don't want to track locations. Memory usage will be high (though still O(n), just a large factor), possibly mitigated (though not by much if length is only 20) with a special "lazy slice" class that stores the base sequence (the string) plus start and stop (or start and length).

Printing unique slices:

a = set(slices("aab", 2)) # {"aa", "ab"}
b = set(slices("abb", 2)) # {"ab", "bb"}
c = set(slices("abc", 2)) # {"ab", "bc"}
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (x for x in all if x is not a), a)
print a_unique # {"aa"}

Including locations:

a = slices("aab", 2)
b = slices("abb", 2)
c = slices("abc", 2)
all = [a, b, c]
import operator
a_unique = reduce(operator.sub, (set(x) for x in all if x is not a), set(a))
# a_unique is only the keys so far
a_unique = dict((k, a[k]) for k in a_unique)
# now it's a dict of slice -> location(s)
print a_unique # {"aa": 0} or {"aa": [0]}
               # (depending on which slices function used)

In a test script closer to your conditions, using randomly generated words of 5m characters and a slice length of 20, memory usage was so high that my test script quickly hit my 1G main memory limit and started thrashing virtual memory. At that point Python spent very little time on the CPU and I killed it. Reducing either the slice length or word length (since I used completely random words that reduces duplicates and increases memory use) to fit within main memory and it ran under a minute. This situation plus O(n**2) in your original code will take forever, and is why algorithmic time and space complexity are both important.

import operator
import random
import string

def slices(seq, length):
  unique = {}
  for start in xrange(len(seq) - length, -1, -1):
    unique[seq[start:start+length]] = start
  return unique

def sample_with_repeat(population, length, choice=random.choice):
  return "".join(choice(population) for _ in xrange(length))

word_length = 5*1000*1000
words = [sample_with_repeat(string.lowercase, word_length) for _ in xrange(3)]
slice_length = 20
words_slices_sets = [set(slices(x, slice_length)) for x in words]
unique_words_slices = [reduce(operator.sub,
                              (x for x in words_slices_sets if x is not n),
                              n)
                       for n in words_slices_sets]
print [len(x) for x in unique_words_slices]

Roger Pate 2009-12-21 18:42:06

Answer 2

+1 A:

You say you have a "word" 5 million characters long, but I find it hard to believe this is a word in the usual sense.

If you can provide more information about your input data, a specific solution might be available.

For example, English text (or any other written language) might be sufficiently repetitive that a trie would be useable. In the worst case however, it would run out of memory constructing all 256^20 keys. Knowing your inputs makes all the difference.

edit

I took a look at some genome data to see how this idea stacked up, using a hardcoded [acgt]->[0123] mapping and 4 children per trie node.

adenovirus 2: 35,937bp -> 35,899 distinct 20-base sequences using 469,339 trie nodes
enterobacteria phage lambda: 48,502bp -> 40,921 distinct 20-base sequences using 529,384 trie nodes.

I didn't get any collisions, either within or between the two data sets, although maybe there is more redundancy and/or overlap in your data. You'd have to try it to see.

If you do get a useful number of collisions, you could try walking the three inputs together, building a single trie, recording the origin of each leaf and pruning collisions from the trie as you go.

If you can't find some way to prune the keys, you could try using a more compact representation. For example you only need 2 bits to store [acgt]/[0123], which might save you space at the cost of slightly more complex code.

I don't think you can just brute force this though - you need to find some way to reduce the scale of the problem, and that depends on your domain knowledge.

Useless 2009-12-22 14:56:16

The question is tagged "bioinformatics", so most probably those aren't English words, but DNA sequences.

Roberto Bonvallet 2009-12-22 15:18:07

So it is. If that implies only 4 characters, it might still work ... 4^20 ~= 10^12, so this is still only workable if there are lots of common subtrees to collapse. I don't know enough about DNA to guess that.

Useless 2009-12-22 15:58:21

Answer 3

A:

Let me build off Roger Pate's answer. If memory is an issue, I'd suggest instead of using the strings as the keys to the dictionary, you could use a hashed value of the string. This would save the cost of the storing the extra copy of the strings as the keys (at worst, 20 times the storage of an individual "word").

import collections
def hashed_slices(seq, length, hasher=None):
  unique = collections.defaultdict(list)
  for start in xrange(len(seq) - length + 1):
    unique[hasher(seq[start:start+length])].append(start)
  return unique

(If you really want to get fancy, you can use a rolling hash, though you'll need to change the function.)

Now, we can combine all the hashes :

unique = []  # Unique words in first string

# create a dictionary of hash values -> word index -> start position
hashed_starts = [hashed_slices(word, 20, hashing_fcn) for word in words]
all_hashed = collections.defaultdict(dict)
for i, hashed in enumerate(hashed_starts) :
   for h, starts in hashed.iteritems() :
     # We only care about the first word
     if h in hashed_starts[0] :
       all_hashed[h][i]=starts

# Now check all hashes
for starts_by_word in all_hashed.itervalues() :
  if len(starts_by_word) == 1 :
    # if there's only one word for the hash, it's obviously valid
    unique.extend(words[0][i:i+20] for i in starts_by_word.values())
  else :
    # we might have a hash collision
    candidates = {}
    for word_idx, starts in starts_by_word.iteritems() :
      candidates[word_idx] = set(words[word_idx][j:j+20] for j in starts)
    # Now go that we have the candidate slices, find the unique ones
    valid = candidates[0]
    for word_idx, candidate_set in candidates.iteritems() :
      if word_idx != 0 :
        valid -= candidate_set
    unique.extend(valid)

(I tried extending it to do all three. It's possible, but the complications would detract from the algorithm.)

Be warned, I haven't tested this. Also, there's probably a lot you can do to simplify the code, but the algorithm makes sense. The hard part is choosing the hash. Too many collisions and you'll won't gain anything. Too few and you'll hit the memory problems. If you are dealing with just DNA base codes, you can hash the 20-character string to a 40-bit number, and still have no collisions. So the slices will take up nearly a fourth of the memory. That would save roughly 250 MB of memory in Roger Pate's answer.

The code is still O(N^2), but the constant should be much lower.

AFoglia 2009-12-22 21:53:46

Answer 4

A:

Let's attempt to improve on Roger Pate's excellent answer.

Firstly, let's keep sets instead of dictionaries - they manage uniqueness anyway.

Secondly, since we are likely to run out of memory faster than we run out of CPU time (and patience), we can sacrifice CPU efficiency for the sake of memory efficiency. So perhaps try only the 20s starting with one particular letter. For DNA, this cuts the requirements down by 75%.

seqlen = 20
maxlength = max([len(word) for word in words])
for startletter in letters:
    for letterid in range(maxlength):
        for wordid,word in words:
            if (letterid < len(word)):
                letter = word[letterid]
                if letter is startletter:
                    seq = word[letterid:letterid+seqlen]
                    if seq in seqtrie and not wordid in seqtrie[seq]:
                        seqtrie[seq].append(wordid)

Or, if that's still too much memory, we can go through for each possible starting pair (16 passes instead of 4 for DNA), or every 3 (64 passes) etc.

Phil H 2009-12-28 22:39:47

ansaurus

tags:

views:

answers:

Python, Huge Iteration Performance Problem

related questions