tags:

views:

461

answers:

4

Please, now that I've re-written the question, and before it suffers from further fast-gun answers or premature closure by eager editors let me point out that this is not a duplicate of this question. I know how to remove duplicates from an array.

This question is about removing sequences from an array, not duplicates in the strict sense.

Consider this sequence of elements in an array;

[0] a
[1] a
[2] b
[3] c
[4] c
[5] a
[6] c
[7] d
[8] c
[9] d

In this example I want to obtain the following...

[0] a
[1] b
[2] c
[3] a
[4] c
[5] d

Notice that duplicate elements are retained but that sequences of the same element have been reduced to a single instance of that element.

Further, notice that when two lines repeat they should be reduced to one set (of two lines).

[0] c
[1] d
[2] c
[3] d

...reduces to...

[0] c
[1] d

I'm coding in C# but algorithms in any language appreciated.

+1  A: 

I would dump them all into your favorite Set implementation.

EDIT: Now that I understand the question, your original solution looks like the best way to do this. Just loop through the array once, keeping an array of flags to mark which elements to keep, plus a counter to keep track to the size of the new array. Then loop through again to copy all the keepers to a new array.

Bill the Lizard
nope, I don't want to remove all duplicates, just sequential duplicates.
Ed Guiness
I think this risks losing the order of the remaining lines...
dmckee
I upvoted this since your misunderstanding was probably due to my unclear original question.
Ed Guiness
Bill the Lizard
A: 

I agree that if you can just dump the strings into a Set, then that might be the easiest solution.

If you don't have access to a Set implementation for some reason, I would just sort the strings alphabetically and then go through once and remove the duplicates. How to sort them and remove duplicates from the list will depend on what language and environment you are running your code.

EDIT: Oh, ick.... I see based on your clarification that you expect that patterns might occur even over separate lines. My approach won't solve your problem. Sorry. Here is a question for you. If I had the following file.

a

a

b

c

c

a

a

b

c

c

Would you expect it to simplify to

a

b

c

Terry
To answer your question, I would expect it to result in a, b, c, a, b, c
Ed Guiness
PS: good question!
Ed Guiness
+2  A: 

EDIT: made some changes and new suggestions

What about a sliding window...

REMOVE LENGTH 2: (no other length has other matches)
//the lower case letters are the matches
ABCBAbabaBBCbcbcbVbvBCbcbcAB  
__ABCBABABABBCBCBCBVBVBCBCBCAB

REMOVE LENGTH 1 (duplicate characters):
//* denote that a string was removed to prevent continual contraction
//of the string, unless this is what you want.
ABCBA*BbC*V*BC*AB
_ABCBA*BBC*V*BC*AB

RESULT:
ABCBA*B*C*V*BC*AB == ABCBABCVBCAB

This is of course starting with length=2, increase it to L/2 and iterate down.

I'm also thinking of two other approaches:

  1. digraph - Set a stateful digraph with the data and iterate over it with the string, if a cycle is found you'll have a duplication. I'm not sure how easy it is check check for these cycles... possibly some dynamic programming, so it could be equivlent to method 2 below. I'm going to have to think about this one as well longer.
  2. distance matrix - using a levenstein distance matrix you might be able to detect duplication from diagonal movement (off the diagonal) with cost 0. This could indicate duplication of data. I will have to think about this more.
nlucaroni
I like this thought, thanks and ++.
Ed Guiness
+1  A: 

Here's C# app i wrote that solves this problem.

takes
aabccacdcd

outputs
abcacd

Probably looks pretty messy, took me a bit to get my head around the dynamic pattern length bit.

class Program
{
    private static List<string> values;
    private const int MAX_PATTERN_LENGTH = 4;

    static void Main(string[] args)
    {
        values = new List<string>();
        values.AddRange(new string[] { "a", "b", "c", "c", "a", "c", "d", "c", "d" });


        for (int i = MAX_PATTERN_LENGTH; i > 0; i--)
        {
            RemoveDuplicatesOfLength(i);
        }

        foreach (string s in values)
        {
            Console.WriteLine(s);
        }
    }

    private static void RemoveDuplicatesOfLength(int dupeLength)
    {
        for (int i = 0; i < values.Count; i++)
        {
            if (i + dupeLength > values.Count)
                break;

            if (i + dupeLength + dupeLength > values.Count)
                break;

            var patternA = values.GetRange(i, dupeLength);
            var patternB = values.GetRange(i + dupeLength, dupeLength);

            bool isPattern = ComparePatterns(patternA, patternB);

            if (isPattern)
            {
                values.RemoveRange(i, dupeLength);
            }
        }
    }

    private static bool ComparePatterns(List<string> pattern, List<string> candidate)
    {
        for (int i = 0; i < pattern.Count; i++)
        {
            if (pattern[i] != candidate[i])
                return false;
        }

        return true;
    }
}

fixed the initial values to match the questions values

sieben
Nice, I think this does it.
Ed Guiness