views:

165

answers:

4

I'm trying to design a data structure that allows efficient extraction of entries from a part of their content.

Let's say I am looking for an entry that matches this: [ x 2 3 x x ]

If [ 0 2 3 4 5 ] or [ 3 2 3 7 8 ] are in my data structure, they should be returned by my find function.

I wrote such a data structure where I compare the "pattern" against all entries of the data structure, but of course it takes way too much time. I have few ideas about how to do this a faster way but they are quite heavy to implement. Does something like this already exists? If not, how would you do this?

+2  A: 

What you're trying to implement looks a lot like tuple space.

vartec
Yes it is actually what I want to do but I need I wanted a hint for implementing it. I guess I'll to look at the code! Thanks
Ben
Let me guess. Distributed programming class homework? ;-)
vartec
Nop ! that is real work :/
Ben
Oh, then why not use already available solutions?
vartec
Well I have specific needs but I simplified a bit the problem to ask my question. Available solutions are not exactly what I want, but they are a good start for implementation, I am on it.
Ben
+4  A: 

Well, a Suffix tree could do it in O(1), but it would take a LOT of memory.

Vilx-
+1. Assuming you have multiple entries, what you really want is called a generalised suffix tree, which is formed by concatenating all the entries together, using distinct (unique) separator symbols between each. Suffix arrays are a (theoretically) slower but much more memory-efficient alternative.
j_random_hacker
P.S.: Suffix trees and arrays rock, but they are *hard* to implement. If you can't find an existing implementation, give yourself a solid 2 weeks to try to understand Ukkonnen's or McCreight's original papers on linear-time construction algorithms.
j_random_hacker
How specifically does it do it in O(1)? You can retrieve all the entries containing a particular substring in O(length(substring)), if that's what you mean. I think the given problem does not restrict the known bits to be a single contiguous substring, either.
Darius Bacon
I think it does restrict it to a single contiguous substring, but yes, it's not very clear on this. As for O(1) - well, it depends on how you look at it. The Big O in this case depends only on the length of the substring, not the amount of entries in the dictionary.
Vilx-
Clasically when one talks about searching/sorting algorithms he measures the Big O in respect to the number of entries, not the length of them. From that point of view it's O(1).
Vilx-
+5  A: 

The first thing that comes to mind is to have a hash table for each position in the tuple. For searching you intersect the results for all the positions with a specified value.

starblue
To hash each item is a starting point, but for large answer sets the intersection takes time and consumes memory. Consider also to hash each subset of a specific window size.
dmeister
A: 

You might take a look at the RETE algorithm. This general problem came up in the AI systems of the 70s; Chapter 14 in Paradigms of AI Programming covers it. (IIRC there's mainly elaborations of starblue's answer and decision trees.)

Darius Bacon