views:

185

answers:

2

I have a list with items where each have number of properties (A, B, C, D) which I would like to filter using template containing same attributes (A, B, C, D). When I use a template I would like to filter all items matching this template. The match is assumed if item is equal to template or is smaller subsequence of it (0 match any item).

Example data

A B C D
1 0 1 0
2 0 0 0
0 0 2 3
2 0 2 1
2 0 2 0
0 0 0 0

Example templates

[2 0 0 0] will filter {[0 0 0 0], [2 0 0 0]}
[2 0 2 0] will filter {[0 0 0 0], [2 0 0 0], [2 0 2 0]}
[2 0 2 1] will filter {[0 0 0 0], [2 0 2 1]}
[3 4 5 6] will filter {[0 0 0 0]}
[0 0 2 0] will filter {[0 0 0 0], [0 0 2 3], [2 0 2 1], [2 0 2 0]}

The problem is that number of comparisons can easily reach 300k and can get slow sometimes. What tricks or structure I could use to make things quicker? Any ideas?

+1  A: 

What about nested hash maps? For example, an item "it" will be stored as:

map(it.A)(it.B)(it.C).(it.D) = it

So [2 0 2 0] could be searched as:

map(2).keys.(2).keys
Daniel
What about 0? Item property with zero value is matched by any value in template. Should I just put my item in 2+ places: map(A)+map(0), map(B)+map(0) and so on?
Sergej Andrejev
+2  A: 

Assuming 4 properties, let's place all the items into 16 buckets.

First bucket is where there are no zero-values for the properties. Selecting from here - simple lookup based on key ABCD.

Second bucket is where the property A == 0. Selecting from here is a lookup on the template with value of BCD.

Third bucket is where B == 0. Selecting from here is a lookup on the template with value of ACD.

Fourth is where A == 0 and B == 0. Selecting from here is a lookup on the template with value of CD.

....

Fifteenth is where A,B,C == 0. the lookup is on D.

Sixteenth is where A,B,C,D == 0. This can be a boolean variable ;-)

Since all of the 16 buckets are 'exact match' - you can use methods like hash tables for the search inside them.

(this proposal is based on the assumption from the example that it's 0 in the prop value that counts as 'match any' and not in the template.) - because the 2000 selected only one value in your exaample. it will obviously be incorrect if the semantics is 'any' in both places.

--

update: corollary: you can have no more than 2^Nproperties matches.

Example:

Let's suppose we have 3 properties A,B,C and the following four items:

itemX[A=1, B=0, C=1] ---> B is a wildcard, so bucketAC[11] = itemX
itemY[A=2, B=0, C=0] ---> B and C are wildcards, so bucketA[2] = itemY
itemZ[A=2, B=1, C=0] ---> C is a wildcard, so bucketAB[21] = itemZ

now, the lookup for a key 'abc' would be as follows (I also include to the right the contents of the buckets for ease of reading, and '<<' means 'accumulate' in this context)

1.results << bucketA[a]       | '2' => itemY[A=2, B=0, C=0]
2.results << bucketB[b]       
3.results << bucketAB[ab]     | '21' => itemZ[A=2, B=1, C=0]
4.results << bucketC[c]
5.results << bucketAC[ac]     | '11' => itemX[A=1, B=0, C=1]
6.results << bucketBC[bc]
7.results << bucketABC[abc]
8.results << bucket_item_all_wildcards

So if we use template [2 0 0], we get the results from key being A=2 in bucketA only. If we use template [2 1 0], then we get the results from key being A=2 in bucketA, and from key being AB=21 in bucketAB - two results.

NB: Of course, the above notation for keys is rather frivolous, it merely assumes "hashtable-like access with the concatenation of the said properties being the key".

If you are allowed to have items with the same properties multiple times, then you will need to have multiple elements in some slots - and then, obviously, you can have more than 2^Nproperties search results, nonetheless you can track the maximum number of duplicates and hence always predict the worst-case maximum number of items.

Notably, if the number of properties grows, the total possible number of buckets will quickly blow up (e.g. 32 properties would mean maximum more than 4 billion buckets), so this idea will no longer be applicable directly, and would need further optimizations around the bucket traversal/allocation.

Andrew Y
Why 16? I counted 14 (A, B, C, D, AB, AC, AD, BC, BD, ABC, ABD, ACD, BCD, ABCD). Anyway, this is usefull idea. Combinding it with what Daniel suggested will surely give some results
Sergej Andrejev
I was to quick to answer. Actually if you think about it. Several buckets can contain same items. I will lose same time (if not more) to eleminate duplicates when merging hash tables
Sergej Andrejev
every position is either zero or not. I used 16 for clarity (a 16th where all prop values are zero either has one member or none :) and in your list you are missing DC, I think.
Andrew Y
nope they should not - because each property is either zero or not - it can not be both zero and nonzero.
Andrew Y
You are right again. Yes, item can be in multiple buckets but using your way will filter just one of them.
Sergej Andrejev
it would not be - because if you have a wildcard in a position, you only put it into one of the 'wildcarded' buckets, not into 'exact' bucket - and it would be only one of them. And since those wildcarded buckets do not have the wildcard property as a key - the lookups there are always exact. So, colloquially it seems like you are never going to have more than 16 matches in this case for any value of template.
Andrew Y
Can you imagine example with three properties (A, B and C). Structure contains just [1 0 0] and therefor added to buckets B, C and BC. When user searches using template [1 1 0] he will look in bucket C. Being in bucket you once again have to do a loop search because hash search won't give results because [1 0 0] have different hash from [1 1 0]. Another option is to search by hash but in several matching buckets A, B and AB. Did I get your idea wrong?
Sergej Andrejev
Rather than having a long comment, I've updated the example with the explanation.
Andrew Y