tags:

views:

322

answers:

3

I was looking at a recent Code Golf on the removal of duplicate characters in a string. i mulled it over and thought that the RLE algorithm would solve it, in fact, I did believe that would resolve removing duplicates, I wrote an implementation here in C, to see how far I could go with it

char *rle(const char *src){
    char *p=(char *)src;
    char *q=(char *)src+1;
    char *rle_enc=NULL, *tmp_rle, buf[10];
    int run=1;
    while (*p){
        while(*q){
            if (*p==*q++) run++,p++;
        }
        sprintf(buf,"%d%c",run,*(p-1));
        p++;
        if (!rle_enc){
            if ((rle_enc=malloc(strlen(buf)+1))!=NULL){
                strcpy(rle_enc,buf);
            }
        }else{
            if ((tmp_rle=realloc(rle_enc,(strlen(rle_enc)+strlen(buf)+1)))!=NULL){
                rle_enc=tmp_rle;
                strcat(rle_enc,buf);
            }
        }
        q=(p+1);
        run=1;
    }
    return rle_enc;
}

Sure enough, here's the main for this:

int main(int argc, char **argv){
    char *test1 = "HHHHHHeeeeeelllllloooooooo";
    char *test2 = "nbHHkRvrXbvkn";
    char *p = rle(test1);
    printf("s = %s\n", test1);
    printf("p = %s\n", p);
    if (p) free(p);
    return 0;
}

According to Code Golf on meta, it should be reusable and solve a set of problems, BUT in the shortest set of characters, fair enough I thought I'd just change the variables to 1 letters and compact the code to make it small..but something wasn't quite right with it as this lead me to think about the RLE algorithm itself, here's a page on Wikipedia about what it has to say and the implementation in Java.

The code does appear to be doing what it should, so I thought, now, it's just a matter of going through the encoded string result from rle looking for those that have a 1 followed by the letter..

I did however notice the limitation of the RLE algorithm, it is only suitable for those that have a set of repetitive characters adjacent to each other. But it failed the test case for the Code Golf, which looks deceptively simple, which leads me to this question:

Is the RLE algorithm flawed? Where would it be used nowadays? Gathering dust I presume due to the volume of data and information flowing around that RLE no longer fits a purpose...

Edit: Thanks to Moonshadow, John and Steve for posting their answers.

There is a fundamental lesson that I've still failed to learn - never ever go OTT and think complex when it comes to this kind of thing, that is a fallacy on my part and shows how big thinking can get in the way and I can get sucked into it too deeply and get carried away without looking at the right angle!!!!! Thanks again! :)

Best regards, Tom.

+2  A: 

RLE was commonly used for 8-bit bitmaps, as they would commonly have long runs of the same character. Windows still supports a RLE video codec that was used in a similar fashion. Nowadays, LZW + Huffman encoding has superceded RLE as the "simple" compression algorithm.

RLE has been used for years, so it us pretty hard to say that it is "flawed", but it certainly isn't efficient.

Most RLE formats will have an "escape character", so that there can be no confusion as to the output.

For example, if we use "e" as an escape character...

This would produce a litteral "e":

ee

This would be the letter "a" repeated twice:

ea2
John Gietzen
@John: Thanks but it still doesnt answer my question about it being flawed...
tommieb75
@tommieb75: What do you mean by flawed? It does what it is supposed to do. It is usable to compress data with continuous blocks of same value, e.g. low-color images with large constant color areas. However, it does not produce enough information (about duplicate but non-adjacent characters) for solving this golf-problem.
Anssi
+4  A: 

RLE will not solve that code golf problem for you.

The code golf problem requires you to strip out all characters that occur more than once in the input, regardless of where the occurrences are. However, RLE, "run length encoding", encodes "runs" - repeated sequences of the same character; multiple runs of the same character can occur in a string, and RLE will encode these separately, by design.

RLE is intended to encode sequences of repeated data elements more compactly by replacing the sequence with just one element followed by the number of times it is repeated. For this purpose it is perfectly adequate. Any "flaw" is not in the algorithm, but rather in the decision to use it for a purpose for which it is poorly suited.

moonshadow
@Moonshadow: Thanks! I did realize after banging my head yesterday on this, that my thinking was not suitable for the purpose!!! :) You get an accepted answer and it shows to be careful not to go OTT and think complex...
tommieb75
+1  A: 

Why do you think it is flawed? RLE works for compressing repeated characters. It isn't intended to do anything else and won't help compress data without run lengths > 1.

In the context of this problem I would say that RLE was just not the right answer, it is not flawed, it is just not helpful in this case.

Steve Haigh
@Steve: Thanks! See my comment below to moonshadow! :)
tommieb75