views:

273

answers:

4

The classical RLE algorithm compresses data by using numbers to represent how many times the character following a number appears in the text at that position. For example:

AAABBAAABBCECE => 3A2B3A2B1C1E1C1E

However, in the above example, that method results in even more space being used by the compressed text. A better idea would be to use numbers to represent how many times the substring following a number appears in the given text. For example:

AAABBAAABBCECE => 2AAABB2CE ("AAABB" twice, then "CE" twice).

Now, my question is: how could I implement an efficient algorithm that finds out the minimum number of characters in an optimal RLE using this method? Brute force methods exist, but I need something faster (at most O(length2)). Perhaps we can use dynamic programming?

A: 

I do not believe dynamic programming will work here, as you could have sub-strings about half the length of the full string in the solution. Looks like you need to use brute force. For a related problem, check out the Lempel-Ziv-Welch Algorithm. It is an efficient algorithm that finds a minimal encoding by using substrings.

Yuval F
Are you sure? Worst case, the length will be original_length + 1 ("1" + given string). I don't see why this means dynamic programming won't work. I think the problem still exhibits optimal substructure: if we can find the answer for a part of the string, we can use that in the answer for a bigger part. I'm just not sure how to do it.
IVlad
Form a graph that is actually a straight line of links by replacing each character by a link of length one. Now, at every point where a run length encoding is possible, create an extra link that bypasses the characters the run length encoding expands to, and whose length is the number of characters needed to encode it. The shortest path through this graph is the shortest encoding - but I don't see a fast way to find which substrings can be replaced by run length encoding - and neither does anybody else so far.
mcdowella
It's definitely possible to be efficient - and by that I mean O(L^2) or O(L^2 log L) or definitely something under O(L^3), as I've first seen this problem in a programming competition where the length L was about 5000 and the time limit a few hundred milliseconds. I think using a graph is overkill and would end up being even slower than a brute force solution that for each character tests how many repeating substrings of a certain length (try all lengths) exist starting with that character, mainly because doing this would still be required to build the graph...
IVlad
A: 

Very clever ways of finding matching substrings may lead to considering suffix trees and suffix arrays. Thinking about suffix arrays and compression may lead you to the http://en.wikipedia.org/wiki/Burrows%E2%80%93Wheeler_transform. That may be the most elegant way of souping up run length encoding.

mcdowella
I don't really see how BWT helps me, or even how it's related to RLE at all...Clever ways of matching substrings can wait. How do I even use substring matching in the solution to my posted problem?
IVlad
BWT is a 'partial sort' - the output will have more runs of characters than the input, and so will RLE better.
Nick Johnson
A: 

A very common way to encode RLE compressed data is to designate a special byte as the "DLE" (sorry, I don't remember what that term stands for), which means "the next is a count followed by a byte".

This way, only repeating sequences needs to be encoded. Typically the DLE symbol is chosen to minimize the chance of it occuring naturally in the uncompressed data.

For your original example, let's set the full stop (or dot) as the DLE, this would encode your example as follows:

AAABBAAABBCECE => 3A2B3A2B1C1E1C1E <-- your encoding
AAABBAAABBCECE => .3ABB.3ABBCECE   <-- my encoding

You would only encode a sequence if it actually ends up as saving space. If you limit the length of sequences to 255, so that the count fits in a byte, a sequence thus takes 3 bytes, the DLE, the count, and the byte to repeat. You would probably not encode 3-byte sequences either, because decoding those carries slightly more overhead than a non-encoded sequence.

In your trivial example, the saving is nonexistant, but if you try to compress a bitmap containing a screenshot of a mostly white program, like Notepad, or a browser, then you'll see real space savings.

If you should encounter the DLE character naturally, just emit a count of 0, since we know we would never encode a 0-length sequence, the DLE followed by a 0-byte means that you decode it as a single DLE byte.

Lasse V. Karlsen
But you're still using more characters than the optimum. Your encoding uses 14 characters, while my second encoding uses 9. Encoding consecutively-repeating substrings is usually the best way. I am interested in an algorithm that gets me the minimum number of characters obtainable by doing what I described in my OP.
IVlad
+3  A: 

It can be done in quadratic cubic quadratic time via dynamic programming.

Here is some Python code:

import sys
import numpy as np

bignum = 10000

S = sys.argv[1] #'AAABBAAABBCECE'                                                                                                                              
N = len(S)

# length of longest substring match bet s[i:] and s[j:]                                                                                                        
maxmatch = np.zeros( (N+1,N+1), dtype=int)

for i in xrange(N-1,-1,-1):
  for j in xrange(i+1,N):
    if S[i] == S[j]:
      maxmatch[i,j] = maxmatch[i+1,j+1]+1

# P[n,k] = cost of encoding first n characters given that last k are a block                                                                                   
P = np.zeros( (N+1,N+1),dtype=int ) + bignum
# Q[n] = cost of encoding first n characters                                                                                                                   
Q = np.zeros(N+1, dtype=int) + bignum

# base case: no cost for empty string                                                                                                                          
P[0,0]=0
Q[0]=0

for n in xrange(1,N+1):
  for k in xrange(1,n+1):
    if n-2*k >= 0:
#     s1, s2 = S[n-k:n], S[n-2*k:n-k]                                                                                                                          
#     if s1 == s2:                                                                                                                                             
      if maxmatch[n-2*k,n-k] >=k:
        # Here we are incrementing the count: C x_1...x_k -> C+1 x_1...x_k                                                                                     
        P[n,k] = min(P[n,k], P[n-k,k])
        print 'P[%d,%d] = %d' % (n,k,P[n,k])
    # Here we are starting a new block: 1 x_1...x_k                                                                                                            
    P[n,k] = min(P[n,k], Q[n-k] + 1 + k)
    print 'P[%d,%d] = %d' % (n,k,P[n,k])
  for k in xrange(1,n+1):
    Q[n] = min(Q[n], P[n,k])

  print

print Q[N]

You can reconstruct the actual encoding by remembering your choices along the way.

I have left out a minor wrinkle, which is that we might have to use an extra byte to hold C+1 if C is large. If you are using 32-bit ints, this will not come up in any context where this algorithm's runtime is feasible. If you are sometimes using shorter ints to save space then you will have to think about it, and maybe add another dimension to your table based on the size of the latest C. In theory, this might add a log(N) factor, but I don't think this will be apparent in practice.

Edit: For the benefit of @Moron, here is the same code with more print statements, so that you can more easily see what the algorithm is thinking:

import sys
import numpy as np

bignum = 10000

S = sys.argv[1] #'AAABBAAABBCECE'                                                                                                                              
N = len(S)

# length of longest substring match bet s[i:] and s[j:]                                                                                                        
maxmatch = np.zeros( (N+1,N+1), dtype=int)

for i in xrange(N-1,-1,-1):
  for j in xrange(i+1,N):
    if S[i] == S[j]:
      maxmatch[i,j] = maxmatch[i+1,j+1]+1

# P[n,k] = cost of encoding first n characters given that last k are a block                                                                                   
P = np.zeros( (N+1,N+1),dtype=int ) + bignum
# Q[n] = cost of encoding first n characters                                                                                                                   
Q = np.zeros(N+1, dtype=int) + bignum

# base case: no cost for empty string                                                                                                                          
P[0,0]=0
Q[0]=0

for n in xrange(1,N+1):
  for k in xrange(1,n+1):
    if n-2*k >= 0:
#     s1, s2 = S[n-k:n], S[n-2*k:n-k]                                                                                                                          
#     if s1 == s2:                                                                                                                                             
      if maxmatch[n-2*k,n-k] >=k:
        # Here we are incrementing the count: C x_1...x_k -> C+1 x_1...x_k                                                                                     
        P[n,k] = min(P[n,k], P[n-k,k])
        print "P[%d,%d] = %d\t I can encode first %d characters of S in only %d characters if I use my solution for P[%d,%d] with %s's count incremented" % (n\
,k,P[n,k],n,P[n-k,k],n-k,k,S[n-k:n])
    # Here we are starting a new block: 1 x_1...x_k                                                                                                            
    P[n,k] = min(P[n,k], Q[n-k] + 1 + k)
    print 'P[%d,%d] = %d\t I can encode first %d characters of S in only %d characters if I use my solution for Q[%d] with a new block 1%s' % (n,k,P[n,k],n,Q[\
n-k]+1+k,n-k,S[n-k:n])
  for k in xrange(1,n+1):
    Q[n] = min(Q[n], P[n,k])

  print
  print 'Q[%d] = %d\t I can encode first %d characters of S in only %d characters!' % (n,Q[n],n,Q[n])
  print


print Q[N]

The last few lines of its output on ABCDABCDABCDBCD are like so:

Q[13] = 7        I can encode first 13 characters of S in only 7 characters!

P[14,1] = 9      I can encode first 14 characters of S in only 9 characters if I use my solution for Q[13] with a new block 1C
P[14,2] = 8      I can encode first 14 characters of S in only 8 characters if I use my solution for Q[12] with a new block 1BC
P[14,3] = 13     I can encode first 14 characters of S in only 13 characters if I use my solution for Q[11] with a new block 1DBC
P[14,4] = 13     I can encode first 14 characters of S in only 13 characters if I use my solution for Q[10] with a new block 1CDBC
P[14,5] = 13     I can encode first 14 characters of S in only 13 characters if I use my solution for Q[9] with a new block 1BCDBC
P[14,6] = 12     I can encode first 14 characters of S in only 12 characters if I use my solution for Q[8] with a new block 1ABCDBC
P[14,7] = 16     I can encode first 14 characters of S in only 16 characters if I use my solution for Q[7] with a new block 1DABCDBC
P[14,8] = 16     I can encode first 14 characters of S in only 16 characters if I use my solution for Q[6] with a new block 1CDABCDBC
P[14,9] = 16     I can encode first 14 characters of S in only 16 characters if I use my solution for Q[5] with a new block 1BCDABCDBC
P[14,10] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[4] with a new block 1ABCDABCDBC
P[14,11] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[3] with a new block 1DABCDABCDBC
P[14,12] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[2] with a new block 1CDABCDABCDBC
P[14,13] = 16    I can encode first 14 characters of S in only 16 characters if I use my solution for Q[1] with a new block 1BCDABCDABCDBC
P[14,14] = 15    I can encode first 14 characters of S in only 15 characters if I use my solution for Q[0] with a new block 1ABCDABCDABCDBC

Q[14] = 8        I can encode first 14 characters of S in only 8 characters!

P[15,1] = 10     I can encode first 15 characters of S in only 10 characters if I use my solution for Q[14] with a new block 1D
P[15,2] = 10     I can encode first 15 characters of S in only 10 characters if I use my solution for Q[13] with a new block 1CD
P[15,3] = 11     I can encode first 15 characters of S in only 11 characters if I use my solution for P[12,3] with BCD's count incremented
P[15,3] = 9      I can encode first 15 characters of S in only 9 characters if I use my solution for Q[12] with a new block 1BCD
P[15,4] = 14     I can encode first 15 characters of S in only 14 characters if I use my solution for Q[11] with a new block 1DBCD
P[15,5] = 14     I can encode first 15 characters of S in only 14 characters if I use my solution for Q[10] with a new block 1CDBCD
P[15,6] = 14     I can encode first 15 characters of S in only 14 characters if I use my solution for Q[9] with a new block 1BCDBCD
P[15,7] = 13     I can encode first 15 characters of S in only 13 characters if I use my solution for Q[8] with a new block 1ABCDBCD
P[15,8] = 17     I can encode first 15 characters of S in only 17 characters if I use my solution for Q[7] with a new block 1DABCDBCD
P[15,9] = 17     I can encode first 15 characters of S in only 17 characters if I use my solution for Q[6] with a new block 1CDABCDBCD
P[15,10] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[5] with a new block 1BCDABCDBCD
P[15,11] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[4] with a new block 1ABCDABCDBCD
P[15,12] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[3] with a new block 1DABCDABCDBCD
P[15,13] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[2] with a new block 1CDABCDABCDBCD
P[15,14] = 17    I can encode first 15 characters of S in only 17 characters if I use my solution for Q[1] with a new block 1BCDABCDABCDBCD
P[15,15] = 16    I can encode first 15 characters of S in only 16 characters if I use my solution for Q[0] with a new block 1ABCDABCDABCDBCD

Q[15] = 9        I can encode first 15 characters of S in only 9 characters!
forefinger
Can you detail how you index the input string S and the other variables? when you say "to x" do you include x as well or only x - 1?Also, this: if S[n-k:n] == S[n-2*k:n-k] doesn't really look right to me. When n = 2 and k = 1, you test S[1:2] with S[0:1]. Shouldn't you be testing such that no characters overlap? I am getting some wrong answers implementing this like you say, for example on the string I posted in my OP, I get length 14, which is wrong. Even if I add/subtract 1 to avoid comparing overlapping characters, I get 10.
IVlad
I simplified things slightly and implemented it in python, and it gets the right answer on AAABBAAABBCECE. In Python, S[a:b] means characters a through b-1. xrange(a,b) similarly counts from a to b-1. This is what I was intending for the S[n-k:n] notation originally, but I agree that it's never really clear how to interpret ambiguities in pseudocode.
forefinger
@forefinger: Are you picking the largest possible prefix which can be written as a repeated block? How does your algorithm work on ABABCDEFBCDEFF? Can you write out your algorithm in english, please.
Moron
That works, thank you! I have a two observations however: first is that the algorithm is O(n^3), because comparing the substrings takes time as well. Second, how would you deal with inputs where the answer is 10ab for example? Your algorithm counts the 10 as one character. It would be nice to get this to O(n^2) somehow, maybe by using kmp to precompute the repeating substrings. Anyway, this is what I was looking for so I'm accepting your answer. If you have any more ideas for improving this please post :).
IVlad
1. It is cubic! Gah! I think I can get rid of it but I'm not sure. 2. This is the minor wrinkle I referred to. You might be able to deal with it by expanding the table like I said. How do you want to encode the numbers? (decimal notation seems like a strange choice)
forefinger
OK, now it's quadratic! I tested it using our favorite example.
forefinger
Can someone explain what the code is doing, please?
Moron
@Moron: I am not always picking the largest possible prefix, since this is not always correct, as your example shows. For your example, the relevant values of P from my algorithm are:P[1,1] = 2 (1A) ...P[2,2] = 3 (1AB) ...P[3,3] = 4 (1ABA) ...P[8,5] = 10 (1ABA1BCDEF) ...P[13,5] = 10 (1ABA2BCDEF) ...P[14,1] = 12 (1ABA2BCDEF1F)
forefinger
In English, I keep finding the best encoding of the first n characters that has the last k characters as a block. If the last k characters are the same as the k before that, I can just increment the count I had from before, like in the 1ABA1BCDEF -> 1ABA2BCDEF step. Otherwise, the last k characters are a new block, and I pay the cost for encoding them, like in the 1ABA -> 1ABA1BCDEF step.
forefinger
Actually, my previous comment should have had fewer values of P: P[3,3] = 4 (1ABA) ... P[8,5] = 10 (1ABA1BCDEF) ... P[13,5] = 10 (1ABA2BCDEF) ... P[14,1] = 12 (1ABA2BCDEF1F)
forefinger
Thanks, that's great :). Anyway I fixed 2. by using a log10 function to see if the new step adds an extra digit or not. Dunno if it's the best way, but it works, and doesn't use any additional memory and it's still pretty fast. Do you think there's a way to use only O(N) or at least less than O(N^2) memory? That would be really awesome :P.
IVlad
Seeing if the new step adds an extra digit is doable, but I was worried that it might change the optimal answer. If an answer was strictly optimal in my sense, it will probably be optimal in your sense also, since adding in a penalty for an extra digit will make it at most tied. But it seems possible that we could have two competing optimal solutions, one of which uses double digits and one which doesn't, in which case things do change.
forefinger
@forefingers: What about ABCDABCDABCDBCD. Would you put the last two BCD together? Sorry for asking too many questions, I am having a hard time trying to decipher the python code! btw, it is quite an interesting approach.
Moron
Sub-quadratic runtime and/or memory use seems unlikely. There are quadratically many substrings of our initial string, and I don't see any way to avoid remembering something about most of them, since basically any substring can be forced into the final output by constructing the right input.
forefinger
@Moron: try running the new code. Hopefully that will at least make the algorithm understandable, if not the exact Python code.
forefinger
I have finally understood the algorithm and am convinced that it is correct and runs in O(n^2) time. This is brilliant! I wish I could give it a +5.
Moron
@forefingers: I think you can publish this (if not already solved). At the very least, you can put it up on arxiv.
Moron
I agree the solution is very good and interesting, however there is definitely a way to reduce the memory used. Like I said in a previous comment, I've seen the problem in a comp sci contest where N was about max 5000 and the memory limit was about 1 megabyte. In fact it's also on a local online judge site, where people solved it using about ~200kb memory. I was only interested in the idea though, I'll try to improve on it myself :).
IVlad
@Mad: do you have a link to the contest problems?
Moron
@Moron: I don't have a link to the contest anymore, but if you understand romanian you can solve it on this online judge: http://infoarena.ro/problema/cod - the limit there is 0.3 seconds, N max 2000, 1MB memory limit. Read length of text from cod.in, then read actual text, then output length of encoded text in cod.out then output encoded text. The limits there are too tight for this solution to get accepted I think.
IVlad
@IVlad: This is driving me totally nuts. Please send me the answer if and when you figure it out.
forefinger
@Moron I don't think it's publishable. I unfortunately have a lot of experience with what isn't publishable.
forefinger
@IVlad: What does this mean: 1 ≤ N ≤ 2000. Pentru determinarea corecta a lungimii minime se acorda 40% din punctaj. 40% din teste vor avea 1 ≤ N ≤ 200.
forefinger
@forefinger: will do, you too :). One way I was thinking might help is to use the KMP algorithm, which could be used to more efficiently find repeating prefixes. I'm not sure how to use this though, since we'd still need the [k] dimension of the P matrix either way.Edit @ forefinger: it means that for each test you get 40% of the score if you only print the minimum length, and that 40% of the tests will have length lower than 200.
IVlad
I know this is usually bad practice, but since I see others are also very much interested in this problem, I am making an email of mine public in case someone is interested in sharing ideas without polluting the solution thread: [email protected] - if you use instant messengers drop me an email with your ID if you don't mind please.
IVlad
OK, you might be able to get rid of P. Instead of storing P and looking it up, do it forwards. Whenever we create a new block, we look ahead and see how many times it is repeated, using match[.,.]. We then move forward, noting down in Q all the costs this allows us to achieve. This looks like it makes the algorithm cubic, but I think it's amortized quadratic, as long as you only do this when you create a NEW block. (Specifically, I think you can charge the steps of the lookahead to the blocks of length k, and there are only quadratically many substrings.)
forefinger
Interesting, I'll try that. I think it is cubic though, in the worst case, isn't it? You have to try creating a new block at each step no?
IVlad
@forefinger: By publishing it I mean have it more easily accessible to web searchers etc. This current answer is a bit hard to digest if you don't know python.
Moron