views:

832

answers:

6
+3  Q: 

Project Euler #219

I'm trying to do project Euler number 219 but am failing to get a grasp of it. I'm trying to use Python which according to project Euler should be able to do it within a minute! This leads me to think that they can't possibly want me to compute each individual bitstring since that would be too slow in Python - there has to be a sub O(n) algorithm.

I have looked at a recursive solution which stores the bitstrings possible prefixes so that it can quickly pick a new bitstring and it even considers them in groups. This only works in brute-forcing values up to a bit over 10:

cost(1) = 1
cost(2) = 5
cost(3) = 11
cost(4) = 18
cost(5) = 26
cost(6) = 35
cost(7) = 44
cost(8) = 54
cost(9) = 64
cost(10)= 74
cost(11)= 85
cost(12)= 96

Past this, I am struggling to comprehend how reduce the problem. It is always possible to make a pattern that goes like:

1
01
001
0001
00001
00000

But it isn't optimal for more than 7 bitstrings. Can anyone guide me in what I should be considering?

+7  A: 

Brute-force is not the right way to do it. This is one of those problems where, if you know a certain thing, it's not that hard, but if you've never heard of that thing, it's practically impossible. That thing is Huffman trees.

[Edit] Upon further review, it seems that you can't quite build a Huffman tree on N nodes with certain frequencies, because the cost function for a string is 4*(# of 1's) + (# of 0's). If the cost function were the length of the string (or a multiple thereof), then you could create a Huffman tree.

Any prefix-free code set can be represented as a Huffman-like binary tree, where each node has either 0 or 2 children, and the leaf nodes represent the codes. Given a tree with N nodes, we can construct a tree with N+1 nodes as follows:

  1. Choose the leaf node with the smallest cost, where the cost of a leaf node is 4*(# of 1's on path from root to leaf) + (# of 0's on path)
  2. Add 2 children to that node

Thus, if the code for the node was previously xxxx, then we remove that code from our code set (since it's no longer a leaf), and add the two codes xxxx0 and xxxx1. The total cost of the code set has now increased by

`cost(xxxx0) + cost(xxxx1) - cost(xxxx) = cost(0) + cost(1) + cost(xxxx) = 5 + cost(xxxx)

Hence, mincost(N+1) <= mincost(N) + 5 + cost(cheapest code in best solution for N). My theory is that that inequality should be an equality, but I haven't been able to prove it yet. For all of the values you've listed which you brute-forced, this statement is in fact an equality.

If it is an equality, then to solve the problem what you would do is this:

  1. Start with an empty code set at a total cost of zero
  2. Iterating from 1 to 109, do:
    1. Find the cheapest code in your code set
    2. Split that code into two by appending 0 and 1
    3. Add the cost of that code + 5 to the total cost

If you use a priority queue, you should be able to do this in O(N log N) time. This may or may not be feasable, given the upper limit of 109.

Adam Rosenfield
A: 

How I solved it, was computing Cost(n) up to n=1000, and then just guessing how it proceeds from there. If you look at the differences of consecutive values, and use The encyclopedia of integer sequences (and some imagination), you can guess the rule.

You can compute the small (<=1000) examples with a kind of dynamic programming, using the recurrence Cost(n) = min {Cost(k)+Cost(n-k)+k+4*(n-k) | 0 < k < n}.

mattiast
+1  A: 

Adam: Thanks loads for the link - it looks very promising! The thing I'm not sure about after reading the Wikipedia article is how the coefficient of 4 is taken into account. I'm struggling to "map" the Project Euler problem to the algorithm. The alphabet would have to be 10^9 items long but what would the weight be?

Another thing that is bothering me is that Huffman coding is at best O(n) surely being too slow as I mentioned above...

mattiast: I don't think that your recurrence works (or I misunderstand it!). My interpretation of it is:

def cost(n):
    if n == 1: return 1

    m = None
    for k in range(1, n):
     v = cost(k)+cost(n-k)+k+4*(n-k)
     if not m or v < m: m = v

    return m

print(cost(6))

The value it returns is 41 when it should be 35. All the same, if my values are correct, then I can't find the differences in ATT's Integer Sequences Encyclopedia.

Your base case is wrong - for n=1, the total cost is 0 by using the 'empty' bitstring.
Adam Rosenfield
That's right, and I told you have to use your imagination! Just look at the differences and you should notice a certain shape of the sequence. I don't want to spoil all the fun for you :)
mattiast
A: 

Adam Rosenfield's solution looks very likely to work. It's late here (around midnight!) so I'll be leaving it 'till the morning. I have an efficient implementation of a priority queue in C so tomorrow I will attempt to use it and find the solution.

I'll report back on the algorithms success but the reasoning seems sound to me and it agrees closely to the data (as said above). However, as I keep mumbling, there must be a sub O(n) algorithm! ;-)

A: 

It turns out that O[n*log(n)] is not too slow but the memory complexity that is roughly O(n) is. However the algorithm proposed above can be reduced further to O(n) time complexity and a low memory complexity too. To do this one can use an array, x, for which x[a] = the number of number values of cost a.

The assumptions made give a correct result for 10^9 so I think they are correct.

A: 

N = 10**9

t = [0]

for c in xrange(N) : m = min(t) t.remove(m) t.append(m+1) t.append(m+4) print sum(t), t