views:

330

answers:

5

I've written a program to perform run length encoding. In typical scenario if the text is

AAAAAABBCDEEEEGGHJ

run length encoding will make it

A6B2C1D1E4G2H1J1

but it was adding extra 1 for each non repeating character. Since i'm compressing BMP files with it, i went with an idea of placing a marker "$" to signify the occurance of a repeating character, (assuming that image files have huge amount of repeating text).

So it'd look like

$A6$B2CD$E4$G2HJ

For the current example it's length is the same, but there's a noticable difference for BMP files. Now my problem is in decoding. It so happens some BMP Files have the pattern $<char><num> i.e. $I9 in the original file, so in the compressed file also i'd contain the same text. $I9, however upon decoding it'd treat it as a repeating I which repeats 9 times! So it produces wrong output. What i want to know is which symbol can i use to mark the start of a repeating character (run) so that it doesn't conflict with the original source.

+5  A: 

Why don't you encode each $ in the original file as $$ in the compressed file?

And/or use some other character instead of $ - one that is not used much in bmp files.

Also note that the BMP format has RLE compression 'built-in' - look here, near the bottom of the page - under "Image Data and Compression".

I don't know what you're using your program for, or if it's just for learning, but if you used the "official" bmp method, your compressed images wouldn't need decompression before viewing.

Blorgbeard
+1 for answering the question and giving an alternative
Mark Pim
A: 

If I understand correctly, the problem is that $ is both a symbol for marking a repeat, and also can be a 'BMP' value as well?

If so, what you could do is to mark a double $ ('$$') character to denote that the '$' character should be treated not as a repeat, but as a single '$'. This would of course mean that the '$' is expensive to encode (takes two symbols instead of 1), but would solve your problem.

If you wanted to have a run of the '$' character, you would need to encode it as: $$$5 - meaning '$' run of '$$'=$, '5' - 5 times.

Andy Wyatt
+2  A: 

AAAAAABBCDEEEEGGHJ$IIIIIIIII ==> $A6$B2CD$E4$G2HJ$$I9

If the repeat character occurs in the data, try inserting an extra repeat character in the encoded data. Then if the decoder sees a double repeat character it can insert the actual repeat character

$A6$B2CD$E4$G2HJ$$I9 ==> AAAAAABBCDEEEEGGHJ$IIIIIIIII

mrwes
+1  A: 

What most programs do to signify that some character needs to be treated literally is that they have a defined escape sequence.

For example, in regular expressions, the following are specially defined characters that usually have a meaning:

^[].*+{}()$

Yes, your fun dollar sign character is in there, and it usually means end of line.

So what a programmer using regular expressions has to do to have these characters interpreted literally is that they need to express those characters as an escape sequence. For example, to interpret $ as $, and not end of line, the programmer uses \$, which is the escape sequence.(1)

In your case, you can store literal dollar signs into your compressed file as \$.(2)

  1. NB: grep inverts this logic.

  2. The above solutions to store $ as $$ becomes confusing when you have runs of $ in the BMP file.

supercheetah
+1  A: 

If you have the luxury of being able to scan the entire input before starting to compress it, you could choose the least frequent value in the input as your escape value. For example, given this input:

AAAABBCCCCDDEEEEEEEFFG

You could choose "G" as your escape value (or even "H" if it's part of your symbol set) and adopt a convention whereby the first character of the encoded stream is the escape value. So the string above might encode to:

GGA4BBGC4DDGE7FFGG

or even better:

HHA4BBHC4DDHE7FFG

Please note that there's no point in encoding a "run" of two identical values because the "compressed" version (e.g. HD2) is longer than the uncompressed version (DD).

Hope that helps!

Anodyne