tags:

views:

608

answers:

6

I need to implement an in-memory tuple-of-strings matching feature in C. There will be large list of tuples associated with different actions and a high volume of events to be matched against the list.

List of tuples:

("one", "four")
("one")
("three")
("four", "five")
("six")

event ("one", "two", "three", "four") should match list item ("one", "four") and ("one") and ("three") but not ("four", "five") and not ("six")

my current approach uses a map of all tuple field values as keys for lists of each tuple using that value. there is a lot of redundant hashing and list insertion.

is there a right or classic way to do this?

+3  A: 

I have no out of the box answer to this, but I like the question so much that I bump you up with a vote.

Btw, if you don't have that much strings in your tuples you can turn them into an index and use a bitmap-search to find all matches. However, this is not practical if you have to deal with more than a couple of dozen different strings.

Nils Pipenbrinck
+1  A: 

I don't know of any classical or right way to do this, so here is what I would do :P

It looks like you want to decide if A is a superset of B, using set theory jargon. One way you can do it is to sort A and B, and do a merge sort-esque operation on A and B, in that you try to find where in A a value in B goes. Those elements of B which are also in A, will have duplicates, and the other elements won't. Because both A and B are sorted, this shouldn't be too horrible.

For example, you take the first value of B, and walk A until you find its duplicate in A. Then you take the second value of B, and start walking A from where you left off previously. If you get to end of A without finding a match, then A is not a superset of B, and you return false.

If these tuples can stay sorted, then the sorting cost is only incurred once.

freespace
+2  A: 

If you only have a small number of possible tuple values it would make sense to write some sort of hashing function which could turn them into integer indexes for quick searching.

If there are < 32 values you could do something with bitmasks:

unsigned int hash(char *value){...}

typedef struct _tuple {
    unsigned int bitvalues;
    void * data
} tuple;

tuple a,b,c,d;
a.bitvalues  = hash("one");
a.bitvalues |= hash("four");
//a.data = something;

unsigned int event = 0;
//foreach value in event;
event |= hash(string_val);

// foreach tuple
if(x->bitvalues & test == test)
{
     //matches
}

If there are too many values to do a bitmask solution you could have an array of linked lists. Go through each item in the event. If the item matches key_one, walk through the tuples with that first key and check the event for the second key:

typedef struct _tuple {
    unsigned int key_one;
    unsigned int key_two;
    _tuple *next;
    void * data;
} tuple;

tuple a,b,c,d;
a.key_one = hash("one");
a.key_two = hash("four");

tuple * list = malloc(/*big enough for all hash indexes*/
memset(/*clear list*/);

//foreach touple item
if(list[item->key_one])
   put item on the end of the list;
else
   list[item->key_one] = item;


//foreach event
   //foreach key
      if(item_ptr = list[key])
        while(item_ptr.next)
           if(!item_ptr.key_two || /*item has key_two*/)
              //match
           item_ptr = item_ptr.next;

This code is in no way tested and probably has many small errors but you should get the idea. (one error that was corrected was the test condition for tuple match)


If event processing speed is of utmost importance it would make sense to iterate through all of your constructed tuples, count the number of occurrences and go through possibly re-ordering the key one/key two of each tuple so the most unique value is listed first.

Frosty
thx, too many for bitmask but the 2nd solution, list of key_one(s), fixes the big problem i had with my own, that i was testing some tuples multiple times against the same event.
navicore
since my main concern is to limit the number of tuples i test against an event, i'm going to implement a variation of this 2nd approach. the variation would be that i'd like key_one to be the most unique part of the tuple. i'll test if the overhead of calculating this helps or hurts. thx.
navicore
A: 

If you have a smallish number of possible strings, you can assign an index to each and use bitmaps. That way a simple bitwise and will tell you if there's overlap.

If that's not practical, your inverted index setup is probably going to be hard to match for speed, especially if you only have to build it once. (does the list of tuples change at runtime?)

Nick Johnson
thx. yes, the list is modified at runtime. the possible strings are not constrained.
navicore
A: 
 public static void Main()
 {
  List<List<string>> tuples = new List<List<string>>();

  string [] tuple = {"one", "four"};
  tuples.Add(new List<string>(tuple));

  tuple = new string [] {"one"};
  tuples.Add(new List<string>(tuple));

  tuple = new string [] {"three"};
  tuples.Add(new List<string>(tuple));

  tuple = new string[]{"four", "five"};
  tuples.Add(new List<string>(tuple));

  tuple = new string[]{"six"};
  tuples.Add(new List<string>(tuple));

  tuple = new string[] {"one", "two", "three", "four"};

  List<string> checkTuple = new List<string>(tuple);

  List<List<string>> result = new List<List<string>>();

  foreach (List<string> ls in tuples)
  {
   bool ok = true;
   foreach(string s in ls)
    if(!checkTuple.Contains(s))
    {
     ok = false;
     break;
    }
   if (ok)
    result.Add(ls);
  }
 }
Jason Lepack
The question was for a C solution. Not C++.
Frosty
+2  A: 

A possible solution would be to assign a unique prime number to each of the words.

Then if you multiply the words together in each tuple, then you have a number that represents the words in the list.

Divide one list by another, and if you get an integer remainder, then the one list is contained in the other.

EvilTeach