views:

278

answers:

4

My main problem is trying to find a suitable solution to automatically turning this, for example:

d+c+d+f+d+c+d+f+d+c+d+f+d+c+d+f+

into this:

[d+c+d+f+]4

i.e. Finding duplicates next to each other, then making a shorter "loop" out of these duplicates. So far I have found no suitable solution to this, and I look forward to a response. P.S. To avoid confusion, the aforementioned sample is not the only thing that needs "looping", it differs from file to file. Oh, and this is intended for a C++ or C# program, either is fine, though I'm open to any other suggestions as well. Also, the main idea is that all the work would be done by the program itself, no user input except for the file itself. Here is the full file, for reference, my apologies for the stretched page: #0 @16 v225 y10 w250 t76

l16 $ED $EF $A9 p20,20 >ecegb>d<bgbgecgec<g >d+<b>d+f+a+>c+<a+f+a+f+d+<b>f+d+<bf+ >c<a>cegbgegec<a>ec<ae > d+c+d+f+d+c+d+f+d+c+d+f+d+c+d+f+ r1^1

/ l8 r1r1r1r1 f+<a+>f+g+cg+r4 a+c+a+g+cg+r4f+<a+>f+g+cg+r4 a+c+a+g+cg+r4f+<a+>f+g+cg+r4 a+c+a+g+cg+r4 f+<a+>f+g+cg+r4 a+c+a+g+r4g+16f16c+ a+2^g+f+g+4 f+ff+4fd+f4 d+c+d+4c+c<a+2^4 >c4d+ <g+2^4r4^ a+>c+d+4g+4a+4 r1^2^4^a+2^g+f+g+4 f+ff+4fd+f4 d+c+d+4c+c<a+2^4 >c4d+ <g+2^4r4^ a+>c+d+4g+4a+4 r1^2^4^ r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1

#4 @22 v250 y10

l8 o3 rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+rg+ / r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1

#2 @4 v155 y10

l8 $ED $F8 $8F o4 r1r1r1 d+4f4f+4g+4 a+4r1^4^2 / d+4^fr2 f+4^fr2d+4^fr2 f+4^fr2d+4^fr2 f+4^fr2d+4^fr2 f+4^fr2 > d+4^fr2 f+4^fr2d+4^fr2 f+4^fr2 < f+4^g+r2 f+4^fr2f+4^g+r2 f+4^fr2f+4^g+r2 f+4^fr2f+4^g+r2 f+4^fr2f+4^g+r2 f+4^fr2f+4^g+r2 f+4^fr2f+4^g+r2 f+4^fr2f+4^g+r2 f+4^fr2 > a+4^g+r2 f+1a+4^g+r2 f+1 f+4^fr2 d+1 f+4^fr2 d+2^d+4^ r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1

#3 @10 v210 y10

r1^1 o3 c8r8d8r8 c8r8c8r8c8r8c8r8c8r8c8r8c8r8c8r8c8r8c8r8c8r8 c8 @10d16d16@21 c8 @10d16d16@21 c8 @10d16d16@21 / c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8 c4@10d8@21c8<b8> @10d16d16d16d16d16r16 c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8c4@10d8@21c8<b8>c8@10d8@21c8 c4@10d8@21c8 @10b16b16>c16c16<b16b16a16a16

#7 @16 v230 y10

l16 $ED $EF $A9 cceeggbbggeeccee <bb>d+d+f+f+a+a+f+f+d+d+<bb>d+d+ <aa>cceeggeecc<aa>cc <g+g+bb>d+d+ffd+d+<bbg+g+bb / r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1

#5 @4 v155 y10

l8 $ED $F8 $8F o4 r1r1r1r1 d+4r1^2^4 / <a+4^>cr2 c+4^cr2<a+4^>cr2 c+4^cr2<a+4^>cr2 c+4^cr2<a+4^>cr2 c+4^cr2 a+4^>cr2 c+4^cr2 <a+4^>cr2 c+4^c r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1 r2 f+4^fr2 d+1f+4^fr2 d+1 c+4^cr2 <a+1 >c+4^cr2 <a+2^a+4^ r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1r1

+2  A: 

You can use the Smith-Waterman algorithm to do local alignment, comparing the string against itself.

http://en.wikipedia.org/wiki/Smith-Waterman_algorithm

EDIT: To adapt the algorithm for self alignment, you need to force values in the diagonal to zero - that is, penalize the trivial solution of aligning the whole string exactly with itself. Then the "second best" alignment will pop out instead. This will be the longest two matching substrings. Repeat the same sort of thing to find progressively shorter matching substrings.

Kevin Beck
A: 

Why not just use System.IO.Compression?

Mike Curry
Mainly because "The strings need to be longer than 3-400 characters" kind of makes this method unwanted.
Onion
+1  A: 

Not sure if this is what you are looking for.

I took the string "testtesttesttest4notaduped+c+d+f+d+c+d+f+d+c+d+f+d+c+d+f+testtesttest" and converted it to "[test]4 4notadupe[d+c+d+f+]4 [test]3 "

I'm sure someone will come up with a better more efficient solution as it's a bit slow when processing your full file. I look forward to other answers.

        string stringValue = "testtesttesttest4notaduped+c+d+f+d+c+d+f+d+c+d+f+d+c+d+f+testtesttest";

        for(int i = 0; i < stringValue.Length; i++)
        {
            for (int k = 1; (k*2) + i <= stringValue.Length; k++)
            {
                int count = 1;

                string compare1 = stringValue.Substring(i,k);
                string compare2 = stringValue.Substring(i + k, k);

                //Count if and how many duplicates
                while (compare1 == compare2) 
                {
                    count++;
                    k += compare1.Length;
                    if (i + k + compare1.Length > stringValue.Length)
                        break;

                    compare2 = stringValue.Substring(i + k, compare1.Length);
                } 

                if (count > 1)
                {
                    //New code.  Added a space to the end to avoid [test]4 
                    //turning using an invalid number ie: [test]44.
                    string addString = "[" + compare1 + "]" + count + " ";

                    //Only add code if we are saving space
                    if (addString.Length < compare1.Length * count)
                    {
                        stringValue = stringValue.Remove(i, count * compare1.Length);
                        stringValue = stringValue.Insert(i, addString);
                        i = i + addString.Length - 1;
                    }
                    break;
                }
            }
        }
Chris Persichetti
Thanks, this works perfectly. I'm not sure what you meant when you said it was slow for the whole file, as it seems quite quick to me. (1-3 seconds at most) But thank you very much, regardless.
Onion
+1  A: 

LZW can help: it uses prefixes dictionary to search for repetitive patterns and replaces such data with references to previous entries. I think it should not be hard to adapt it for your needs.

Rageous
Interactive LZW compression applet: http://www.cs.sfu.ca/CC/365/li/squeeze/LZW.html
Rageous