views:

110

answers:

3

Hello,

I have an array of strings in C and an integer indicating how many strings are in the array.

char *strarray[MAX];  
int strcount;

In this array, the highest index (where 10 is higher than 0) is the most recent item added and the lowest index is the most distant item added. The order of items within the array matters.

I need a quick way to check the array for duplicates, remove all but the highest index duplicate, and collapse the array.

For example:

strarray[0] = "Line 1"; 
strarray[1] = "Line 2"; 
strarray[2] = "Line 3"; 
strarray[3] = "Line 2"; 
strarray[4] = "Line 4";

would become:

strarray[0] = "Line 1"; 
strarray[1] = "Line 3"; 
strarray[2] = "Line 2"; 
strarray[3] = "Line 4";

Index 1 of the original array was removed and indexes 2, 3, and 4 slid downwards to fill the gap.

I have one idea of how to do it. It is untested and I am currently attempting to code it but just from my faint understanding, I am sure this is a horrendous algorithm.

The algorithm presented below would be ran every time a new string is added to the strarray.

For the interest of showing that I am trying, I will include my proposed algorithm below:

  1. Search entire strarray for match to str
  2. If no match, do nothing
  3. If match found, put str in strarray
  4. Now we have a strarray with a max of 1 duplicate entry
  5. Add highest index strarray string to lowest index of temporary string array
  6. Continue downwards into strarray and check each element
  7. If duplicate found, skip it
  8. If not, add it to the next highest index of the temporary string array
  9. Reverse temporary string array and copy to strarray

Once again, this is untested (I am currently implementing it now). I just hope someone out there will have a much better solution.

The order of items is important and the code must utilize the C language (not C++). The lowest index duplicates should be removed and the single highest index kept.

Thank you!

+1  A: 

The typical efficient unique function is to:

  1. Sort the given array.
  2. Verify that consecutive runs of the same item are setup so that only one remains.

I believe you can use qsort in combination with strcmp to accomplish the first part; writing an efficient remove would be all on you though.

Unfortunately I don't have specific ideas here; this is kind of a grey area for me because I'm usually using C++, where this would be a simple:

std::vector<std::string> src;
std::sort(src.begin(), src.end());
src.remove(std::unique(src.begin(), src.end()), src.end);

I know you can't use C++, but the implementation should essentially be the same.

Because you need to save the original order, you can have something like:

typedef struct
{
    int originalPosition;
    char * string;
} tempUniqueEntry;

Do your first sort with respect to string, remove unique sets of elements on the sorted set, then resort with respect to originalPosition. This way you still get O(n lg n) performance, yet you don't lose the original order.

EDIT2: Simple C implementation example of std::unique:

tempUniqueEntry* unique ( tempUniqueEntry * first, tempUniqueEntry * last )
{
  tempUniqueEntry *result=first;
  while (++first != last)
  {
    if (strcmp(result->string,first->string))
      *(++result)=*first;
  }
  return ++result;
}
Billy ONeal
wouldn't sorting lose the order of the elements?
Jerry Smith
@Jerry: Answer edited.
Billy ONeal
Thank you for your edit! I am a little rusty on sorting but I can take it from here.I am going to try your idea and see how well it works.From what I understand, I need to iterate the strarray making a temp array of tempUniqueEntry in the process. Sort tempArray by string, remove duplicates, sort tempArray again by position, then reconstruct the strarray. Correct?
Jerry Smith
@Jerry: Yes, that's correct. You don't need to implement your own sorting algorithm; the standard library's `qsort` can do that. You'll just have to define comparison functions. As for remove, you may wish to follow `std::unique`, (like this one: http://www.cplusplus.com/reference/algorithm/unique/ ) because it can make the entire array unique in linear time -- no constant resizing of the array will be required as things are removed. (Just re implement std::unique yourself, replacing the predicate function with a function pointer returning equals, and the iterators with pointers)
Billy ONeal
A: 

Can you control the input as it is going into the array? If so, just do something like this:

int addToArray(const char * toadd, char * strarray[], int strcount)
{
    const int toaddlen = strlen(toadd);

    // Add new string to end.
    // Remember to add one for the \0 terminator.
    strarray[strcount] = malloc(sizeof(char) * (toaddlen + 1));
    strncpy(strarray[strcount], toadd, toaddlen + 1);

    // Search for a duplicate.
    // Note that we are cutting the new array short by one.
    for(int i = 0; i < strcount; ++i)
    {
        if (strncmp(strarray[i], toaddlen + 1) == 0)
        {
            // Found duplicate.
            // Remove it and compact.
            // Note use of new array size here.  
            free(strarray[i]);
            for(int k = i + 1; k < strcount + 1; ++k)
                strarray[i] = strarray[k];

            strarray[strcount] = null;
            return strcount;
        }
    }

    // No duplicate found.
    return (strcount + 1);
}

You can always use the above function looping over the elements of an existing array, building a new array without duplicates.

PS: If you are doing this type of operation a lot, you should move away from an array as your storage structure, and used a linked list instead. They are much more efficient for removing elements from a location other than the end.

jdmichal
This works well; it's better than the OP's original solution. +1. But unfortunately the performance is still order n-squared :(
Billy ONeal
As I understand your solution, if it is in strarray already it does nothing. If it is not, it adds it. If I am correct in my understanding, this will not work. I can control the input as it is entering the array but this method would not produce the result I gave in my post.I need the surviving duplicate to be in the highest, not the lowest, index. If toadd already exists in strarray[1] it would not be added to strarray[N] where N > 1
Jerry Smith
@Jerry Smith Your example is wrong then. It should read 1, 3, 2, 4. I'll correct my solution shortly... But that is a much more expensive operation, because it will require compacting the array each time.
jdmichal
@Jerry Smith Added removal and compacting. Please note my PS at the end.
jdmichal
@jdmichal you are right. My example was wrong. i have edited my post now to be correct. Sorry about that.I absolutely must have the array strings in their original order with the highest index kept and all lower index duplicates removed.However, I have the luxury of knowing my array will never have more than 10 elements in it so, in that regard, I am willing to go with a lower performing algorithm. I was just hoping something out there was better than my algorithm.
Jerry Smith
This is not C (see: new/delete)
Nick Presta
I am doing this operation somewhat frequently but this is a temporary project so I am not too concerned with finding efficient methods. Just wanted something a little easier to understand and implement that might be a little faster.
Jerry Smith
Cool. More down-votes for piddly crap. Corrected solution to use `malloc` and `free` instead of `new` and `delete`.
jdmichal
@jdmichal for what its worth, I wasn't looking for someone to give me full and complete tested code in C. I am smart enough to transform new/delete to malloc/free :) But apparently the stackoverflow community prefers answers to be complete when possible
Jerry Smith
@jdmichal: It's not 'piddly crap' - the code didn't even compile on a C compiler...
Nick Presta
@Jerry Smith Thank you for understanding that some code typed into a webform in a time-pressured scenario may not compile without minor modifications, and instead understanding that the idea presented is the important part.
jdmichal
A: 

I don't quite understand your proposed algorithm (I don't understand what it means to add a string to an index in step 5), but what I would do is:

unsigned int i;
for (i = n; i > 0; i--)
{
    unsigned int j;

    if (strarray[i - 1] == NULL)
    {
        continue;
    }

    for (j = i - 1; j > 0; j--)
    {
        if (strcmp(strarray[i - 1], strarray[j - 1]) == 0)
        {
            strarray[j - 1] = NULL;
        }
    }
}

Then you just need to filter the null pointers out of your array (which I'll leave as an exercise).

A different approach would be to iterate backwards over the array and to insert each item into a (balanced) binary search tree as you go. If the item is already in the binary search tree, flag the array item (such as setting the array element to NULL) and move on. When you've processed the entire array, filter out the flagged elements as before. This would have slightly more overhead and would consume more space, but its running time would be O(n log n) instead of O(n^2).

jamesdlin
What I meant in step 5 is simply:// where 0 is the lowest index and 9 is the largest index availabletemparray[0] = strarray[9];
Jerry Smith