views:

149

answers:

5

I'm working on a statistical project that involves iterating over every possible way to partition a collection of strings and running a simple calculation on each. Specifically, each possible substring has a probability associated with it, and I'm trying to get the sum across all partitions of the product of the substring probability in the partition.

For example, if the string is 'abc', then there would be probabilities for 'a', 'b', 'c', 'ab, 'bc' and 'abc'. There are four possible partitionings of the string: 'abc', 'ab|c', 'a|bc' and 'a|b|c'. The algorithm needs to find the product of the component probabilities for each partitioning, then sum the four resultant numbers.

Currently, I've written a python iterator that uses binary representations of integers for the partitions (eg 00, 01, 10, 11 for the example above) and simply runs through the integers. Unfortunately, this is immensely slow for strings longer than 20 or so characters.

Can anybody think of a clever way to perform this operation without simply running through every partition one at a time? I've been stuck on this for days now.

In response to some comments here is some more information:
The string can be just about anything, eg "foobar(foo2)" -- our alphabet is lowercase alphanumeric plus all three type of braces ("(","[","{"), hyphens and spaces.
The goal is to get the likelihood of the string given individual 'word' likelihoods. So L(S='abc')=P('abc') + P('ab')P('c') + P('a')P('bc') + P('a')P('b')P('c') (Here "P('abc')" indicates the probability of the 'word' 'abc', while "L(S='abc')" is the statistical likelihood of observing the string 'abc').

+3  A: 

First, profile to find the bottleneck.

If the bottleneck is simply the massive number of possible partitions, I recommend parallelization, possibly via multiprocessing. If that's still not enough, you might look into a Beowulf cluster.

If the bottleneck is just that the calculation is slow, try shelling out to C. It's pretty easy to do via ctypes.

Also, I'm not really sure how you're storing the partitions, but you could probably squash memory consumption a pretty good bit by using one string and a suffix array. If your bottleneck is swapping and/or cache misses, that might be a big win.

Hank Gay
I've run the python profiler, and most of the time is spent just iterating. I do suspect that parallelization is the only answer, but I'm was hoping there's some way to make the complexity increase something less than exponentially with string length. (Fortunately, I do have access to some pretty impressive clusters for this...)
Peter McMahan
Unless there's some relationship between the probabilities for different input partitions, I don't see a way to avoid all that work. If there is a relation, you may be able to leverage that relation to avoid iterating across every partition.
Hank Gay
And thanks for the suffix array reference. This will help a lot with several parts of the larger problem.
Peter McMahan
+5  A: 

A Dynamic Programming solution (if I understood the question right):

def dynProgSolution(text, probs):
  probUpTo = [1]
  for i in range(1, len(text)+1):
    cur = sum(v*probs[text[k:i]] for k, v in enumerate(probUpTo))
    probUpTo.append(cur)
  return probUpTo[-1]

print dynProgSolution(
  'abc',
  {'a': 0.1, 'b': 0.2, 'c': 0.3,
   'ab': 0.4, 'bc': 0.5, 'abc': 0.6}
  )

The complexity is O(N2) so it will easily solve the problem for N=20.

How why does this work:

  • Everything you will multiply by probs['a']*probs['b'] you will also multiply by probs['ab']
  • Thanks to the Distributive Property of multiplication and addition, you can sum those two together and multiply this single sum by all of its continuations.
  • For every possible last substring, it adds the sum of all splits ending with that by adding its probability multiplied by the sum of all probabilities of previous paths. (alternative phrasing would be appreciated. my python is better than my english..)
yairchu
Looks interesting. It'll take me a bit to figure how/what it's doing. Thanks!
Peter McMahan
@Peter McMahan: I added a bit of explanation too. I hope it helps
yairchu
Very nice, thank you.
Peter McMahan
Another question (not so much experience with this type of algorithm). Is there any nice way to parallelize it into multiple processes (maybe something like what Alex Martelli suggested below)
Peter McMahan
@Peter: I'm afraid there's no nice way to parallelize it as far as I can tell. Luckily it probably won't be needed imo.
yairchu
+1  A: 

Your substrings are going to be reused over and over again by the longer strings, so caching the values using a memoizing technique seems like an obvious thing to try. This is just a time-space trade off. The simplest implementation is to use a dictionary to cache values as you calculate them. Do a dictionary lookup for every string calculation; if it's not in the dictionary, calculate and add it. Subsequent calls will make use of the pre-computed value. If the dictionary lookup is faster than the calculation, you're in luck.

I realise you are using Python, but... as a side note that may be of interest, if you do this in Perl, you don't even have to write any code; the built in Memoize module will do the caching for you!

ire_and_curses
I've been tinkering back and forth with different levels of caching to speed things up. The dataset is pretty big and partitions increase exponentially with string length, so it quickly gets too big for physical ram. I've been playing with SQLite and Tokyo Cabinet to try to do on-disk caching, which I think will be a good approach.
Peter McMahan
+1  A: 

You may get a minor reduction of the amount of computation by a small refactoring based on associative properties of arithmetic (and string concatenation) though I'm not sure it will be a life-changer. The core idea would be as follows:

consider a longish string e.g. 'abcdefghik', 10-long, for definiteness w/o loss of generality. In a naive approach you'd be multiplying p(a) by the many partitions of the 9-tail, p(ab) by the many partitions of the 8-tail, etc; in particular p(a) and p(b) will be multiplying exactly the same partitions of the 8-tail (all of them) as p(ab) will -- 3 multiplications and two sums among them. So factor that out:

(p(ab) + p(a) * p(b)) * (partitions of the 8-tail)

and we're down to 2 multiplications and 1 sum for this part, having saved 1 product and 1 sum. to cover all partitions with a split point just right of 'b'. When it comes to partitions with a split just right of 'c',

(p(abc) + p(ab) * p(c) + p(a) * (p(b)*p(c)+p(bc)) * (partitions of the 7-tail)

the savings mount, partly thanks to the internal refactoring -- though of course one must be careful about double-counting. I'm thinking that this approach may be generalized -- start with the midpoint and consider all partitions that have a split there, separately (and recursively) for the left and right part, multiplying and summing; then add all partitions that DON'T have a split there, e.g. in the example, the halves being 'abcde' on the left and 'fghik' on the right, the second part is about all partitions where 'ef' are together rather than apart -- so "collapse" all probabilities by considering that 'ef' as a new 'superletter' X, and you're left with a string one shorter, 'abcdXghik' (of course the probabilities for the substrings of THAT map directly to the originals, e.g. the p(cdXg) in the new string is just exactly the p(cdefg) in the original).

Alex Martelli
This will definitely help. I was trying to figure out a good way to use these kinds of partition subspaces. The only problem I see is that this will take a lot of memory for longer strings, which might complicate any parallelization efforts (we're trying to avoid the world of distributed databases, though that may not be possible).
Peter McMahan
When parallelizing computation on (say) a 20-char string, this breaks the problem down into two 10-char strings (break in the middle), ending in two numbers to be multiplied, and one 19-char string (merging the 10th and 11th letters of the original); you can repeat the breakdown a few more times until you have the right granularity/number of subtasks to dispatch around to several processors or nodes.
Alex Martelli
A: 

You should look into the itertools module. It can create a generator for you that is very fast. Given your input string, it will provide you with all possible permutations. Depending on what you need, there is also a combinations() generator. I'm not quite sure if you're looking at "b|ca" when you're looking at "abc," but either way, this module may prove useful to you.

rledley
Looks like the partitions-into-substrings the OP's looking at are order preserving -- essentially for an N-char string there either is or isn't a break at each of the N-1 "spots in-between", so 2**(N-1) possible partitions. I love itertools, but it really doesn't have much to contribute here!-)
Alex Martelli