views:

114

answers:

4

Hey. I know this is not a 'refactor my code' site but I made this little piece of code which works perfectly fine with moderately sized input but it's problematic with string of size, say, over 2000.

What it does - it takes a string of numbers as a parameter, and it returns the number of ways it can be interpreted as a string of letters, where each letter in the English alphabet is assigned a numeric value according to its lexical position: A -> 1, B-> 2, Z-> 26 etc.

Since some letters are represented as two numbers the suffix tree is not unique so there can be multiple interpretations. For example '111' could be both 'AAA', 'KA' and 'AK'.

This is my code. It's fairly readable and straightforward but it's problematic because:

  1. It has to copy part of the string every time to call it as argument to the recursive part.
  2. It has to store huge strings in the cache so it's very expensive, memory-wise.
  3. ... it's recursive.

Help much appreciated :)

cache = dict()
def alpha_code(numbers):
    """
    Returns the number of ways a string of numbers
    can be interpreted as an alphabetic sequence.
    """
    global cache
    if numbers in cache: return cache[numbers]

    ## check the basic cases
    if numbers.startswith('0'): return 0
    if len(numbers) <= 1: return 1

    ## dynamic programming part

    ## obviously we can treat the first (non-zero)
    ## digit as a single letter and continue -
    ## '342...' -> C + '42...'
    total = alpha_code(numbers[1:])

    ## the first two digits make for a legal letter
    ## iff this condition holds
    ## '2511...' -> Y + '11...'
    ## '3711...' -> illegal
    if numbers[:2] <= '26':
        total += alpha_code(numbers[2:])

    cache[numbers] = total
    return total
+3  A: 

Try using a dynamic programming approach instead:

  1. Create an array (call it 'P') with 1 element per character in the string.
  2. Initialize P[0] = 1 (unless the first character is 0, in which case just return 0 for the result).
  3. Initialize P[1] = 2 if the first two characters can be interpreted as a letter as can the current; otherwise 1 if the current character is non-zero, otherwise return 0 for the result).
  4. Fill in the rest of the array from left to right, via the following rule (pseudocode):

    P[x] = (if current character is '0' then 0, else P[x-1]) + (if previous character + current character can be interpreted as a letter then P[x-2] else 0)

(Note that if P[x] is ever 0 you should return zero, since that means there were two 0's in a row which your rules don't seem to allow.)

The first portion of the sum is to deal with the case where the current character is interpreted as a letter; the second part of the sum is to deal with the case where the 2 most recent characters are interpreted as a letter.

Essentially, P[x] will be equal to the number of ways that the entirety of the string from the start up to position x can be interpreted as letters. Since you can determine this from looking at previous results, you only need to loop through the contents of the string once - an O(N) time instead of a O(2N) which is a huge improvement. Your final result is simply P[len(input)-1] since "everything from the start up to the end" is the same as just "the entire string".

Example run for your very basic input case of '111':

  • P[0] = 1 (Since 1 is non-zero)
  • P[1] = 2 (Since 11 is a valid letter, and 1 is also a valid letter)
  • P[2] = 3 (Since the most recent two characters together are a valid letter, and the current character is nonzero, so P[0]+P[1] = 1+2 = 3)

Since P[2] is our last result, and it's 3, our answer is 3.

If the string were '1111' instead, we'd continue another step:

  • P[3] = 5 (Since the most recent two characters are a valid letter, and current character is non-zero, so P[1]+P[2] = 2+3 = 5)

The answer is indeed 5 - valid interpretations being AAAA, KK, AKA, AAK, KAA. Notice how those 5 potential answers are built up from the potential interpretations of '11' and '111':

'11': AA or K '111': AAA or KA or AK

'111'+A: AAA+A or KA+A or AK+A '11'+K: AA+K or K+K

Amber
Yeah, that's great solution. Thanks.
ooboo
Nice algorithm! Note that you actually don't have to save the intermediate results in an array: keeping the last two `P` results is sufficient.
EOL
Yep EOL. The array approach can make the concept a little easier to visualize though, and then once the basics behind the algorithm are grasped, it's fairly simple to modify to only store what's necessary.
Amber
A: 

A non-recursive algorithm can be written but I don't think it will be faster. I'm no python expert so I'll just give you an algorithm:

Convert the array on numbers to an array of letters using just A thru I and leaving the zeros in place. 
Create two nested loops where you search and replace all the known pairs that represent larger letters. (AA -> K)

The nice thing about this algorithm is you can optimize the search/replace by first searching for and indexing all the As and Bs in the array.

Since you are using Python, regardless of what you do, you should convert the string into a list of numbers. The numbers 0-9 are static objects in Python so that means they are free to allocate. You could also create reusable character objects of A thru Z. The other benefit of a list is the replace operation to remove two elements and insert a single element is problem much faster than copying the string over and over.

jmucchiello
+1  A: 

Recursion elimination is always a fun task. Here, I'd focus on ensuring the cache is correctly populated, then just use it, as follows...:

import collections

def alpha_code(numbers):
    # populate cache with all needed pieces
    cache = dict()
    pending_work = collections.deque([numbers])
    while pending_work:
      work = pending_work.popleft()
      # if cache[work] is known or easy, just go for it
      if work in cache:
        continue
      if work[:1] == '0':
        cache[work] = 0
        continue
      elif len(work) <= 1:
        cache[work] = 1
        continue
      # are there missing pieces? If so queue up the pieces
      # on the left (shorter first), the current work piece
      # on the right, and keep churning
      n1 = work[1:]
      t1 = cache.get(n1)
      if t1 is None:
        pending_work.appendleft(n1)
      if work[:2] <= '26':
        n2 = work[2:]
        t2 = cache.get(n2)
        if t2 is None:
          pending_work.appendleft(n2)
      else:
        t2 = 0
      if t1 is None or t2 is None:
        pending_work.append(work)
        continue
      # we have all pieces needed to add this one
      total = t1 + t2
      cache[work] = total

    # cache fully populated, so we know the answer
    return cache[numbers]
Alex Martelli
Upvoted. I've been trying to do the wrong thing, apparently - but this is how you do wrong things right :P
ooboo
A: 

You can largely reduce the memory footprint by not copying the string, and passing instead the original string and the index of the first character to study:

def alpha_code(numbers, start_from = 0)
    ....

You would then call recursively as:

alpha_code(numbers, start_from + 1)  # or start_from + 2, etc.

This way, your preserve the simplicity of a recursive algorithm and you save a lot of memory.

EOL
Hey EOL. There's a problem with this code when I call the function several times, what do I hash? I hash where I started from? So what if I call it multiple times, the hashtable will go crazy.Unless I could in some way make a hashtable that stores distinct hashtable each hashtable will refer to a specific string.
ooboo