ansaurus

Question

Are there any cleverly efficient algorithms to perform a calculation over the space of partitionings of a string?

Answer 1

+3 A:

First, profile to find the bottleneck.

If the bottleneck is simply the massive number of possible partitions, I recommend parallelization, possibly via multiprocessing. If that's still not enough, you might look into a Beowulf cluster.

If the bottleneck is just that the calculation is slow, try shelling out to C. It's pretty easy to do via ctypes.

Also, I'm not really sure how you're storing the partitions, but you could probably squash memory consumption a pretty good bit by using one string and a suffix array. If your bottleneck is swapping and/or cache misses, that might be a big win.

Hank Gay 2009-08-03 15:40:49

I've run the python profiler, and most of the time is spent just iterating. I do suspect that parallelization is the only answer, but I'm was hoping there's some way to make the complexity increase something less than exponentially with string length. (Fortunately, I do have access to some pretty impressive clusters for this...)

Peter McMahan 2009-08-03 15:48:26

Unless there's some relationship between the probabilities for different input partitions, I don't see a way to avoid all that work. If there is a relation, you may be able to leverage that relation to avoid iterating across every partition.

Hank Gay 2009-08-03 15:50:50

And thanks for the suffix array reference. This will help a lot with several parts of the larger problem.

Peter McMahan 2009-08-03 15:51:54

Answer 2

+5 A:

A Dynamic Programming solution (if I understood the question right):

def dynProgSolution(text, probs):
  probUpTo = [1]
  for i in range(1, len(text)+1):
    cur = sum(v*probs[text[k:i]] for k, v in enumerate(probUpTo))
    probUpTo.append(cur)
  return probUpTo[-1]

print dynProgSolution(
  'abc',
  {'a': 0.1, 'b': 0.2, 'c': 0.3,
   'ab': 0.4, 'bc': 0.5, 'abc': 0.6}
  )

The complexity is O(N²) so it will easily solve the problem for N=20.

How why does this work:

Everything you will multiply by probs['a']*probs['b'] you will also multiply by probs['ab']
Thanks to the Distributive Property of multiplication and addition, you can sum those two together and multiply this single sum by all of its continuations.
For every possible last substring, it adds the sum of all splits ending with that by adding its probability multiplied by the sum of all probabilities of previous paths. (alternative phrasing would be appreciated. my python is better than my english..)

yairchu 2009-08-03 15:57:35

Looks interesting. It'll take me a bit to figure how/what it's doing. Thanks!

Peter McMahan 2009-08-03 16:27:10

@Peter McMahan: I added a bit of explanation too. I hope it helps

yairchu 2009-08-03 17:22:35

Very nice, thank you.

Peter McMahan 2009-08-03 18:01:03

Another question (not so much experience with this type of algorithm). Is there any nice way to parallelize it into multiple processes (maybe something like what Alex Martelli suggested below)

Peter McMahan 2009-08-03 18:15:46

@Peter: I'm afraid there's no nice way to parallelize it as far as I can tell. Luckily it probably won't be needed imo.

yairchu 2009-08-03 19:41:37

Answer 3

+1 A:

Your substrings are going to be reused over and over again by the longer strings, so caching the values using a memoizing technique seems like an obvious thing to try. This is just a time-space trade off. The simplest implementation is to use a dictionary to cache values as you calculate them. Do a dictionary lookup for every string calculation; if it's not in the dictionary, calculate and add it. Subsequent calls will make use of the pre-computed value. If the dictionary lookup is faster than the calculation, you're in luck.

I realise you are using Python, but... as a side note that may be of interest, if you do this in Perl, you don't even have to write any code; the built in Memoize module will do the caching for you!

ire_and_curses 2009-08-03 15:58:10

I've been tinkering back and forth with different levels of caching to speed things up. The dataset is pretty big and partitions increase exponentially with string length, so it quickly gets too big for physical ram. I've been playing with SQLite and Tokyo Cabinet to try to do on-disk caching, which I think will be a good approach.

Peter McMahan 2009-08-03 16:17:40

Answer 4

+1 A:

You may get a minor reduction of the amount of computation by a small refactoring based on associative properties of arithmetic (and string concatenation) though I'm not sure it will be a life-changer. The core idea would be as follows:

consider a longish string e.g. 'abcdefghik', 10-long, for definiteness w/o loss of generality. In a naive approach you'd be multiplying p(a) by the many partitions of the 9-tail, p(ab) by the many partitions of the 8-tail, etc; in particular p(a) and p(b) will be multiplying exactly the same partitions of the 8-tail (all of them) as p(ab) will -- 3 multiplications and two sums among them. So factor that out:

(p(ab) + p(a) * p(b)) * (partitions of the 8-tail)

and we're down to 2 multiplications and 1 sum for this part, having saved 1 product and 1 sum. to cover all partitions with a split point just right of 'b'. When it comes to partitions with a split just right of 'c',

(p(abc) + p(ab) * p(c) + p(a) * (p(b)*p(c)+p(bc)) * (partitions of the 7-tail)

the savings mount, partly thanks to the internal refactoring -- though of course one must be careful about double-counting. I'm thinking that this approach may be generalized -- start with the midpoint and consider all partitions that have a split there, separately (and recursively) for the left and right part, multiplying and summing; then add all partitions that DON'T have a split there, e.g. in the example, the halves being 'abcde' on the left and 'fghik' on the right, the second part is about all partitions where 'ef' are together rather than apart -- so "collapse" all probabilities by considering that 'ef' as a new 'superletter' X, and you're left with a string one shorter, 'abcdXghik' (of course the probabilities for the substrings of THAT map directly to the originals, e.g. the p(cdXg) in the new string is just exactly the p(cdefg) in the original).

Alex Martelli 2009-08-03 15:58:36

This will definitely help. I was trying to figure out a good way to use these kinds of partition subspaces. The only problem I see is that this will take a lot of memory for longer strings, which might complicate any parallelization efforts (we're trying to avoid the world of distributed databases, though that may not be possible).

Peter McMahan 2009-08-03 16:10:41

When parallelizing computation on (say) a 20-char string, this breaks the problem down into two 10-char strings (break in the middle), ending in two numbers to be multiplied, and one 19-char string (merging the 10th and 11th letters of the original); you can repeat the breakdown a few more times until you have the right granularity/number of subtasks to dispatch around to several processors or nodes.

Alex Martelli 2009-08-03 17:08:33

Answer 5

A:

You should look into the itertools module. It can create a generator for you that is very fast. Given your input string, it will provide you with all possible permutations. Depending on what you need, there is also a combinations() generator. I'm not quite sure if you're looking at "b|ca" when you're looking at "abc," but either way, this module may prove useful to you.

rledley 2009-08-03 16:16:28

Looks like the partitions-into-substrings the OP's looking at are order preserving -- essentially for an N-char string there either is or isn't a break at each of the N-1 "spots in-between", so 2**(N-1) possible partitions. I love itertools, but it really doesn't have much to contribute here!-)

Alex Martelli 2009-08-03 17:10:12

ansaurus

tags:

views:

answers:

Are there any cleverly efficient algorithms to perform a calculation over the space of partitionings of a string?

related questions