ansaurus

Question

Answer 1

+5 A:

With no word-separation, I don't think that the task you desire is feasible (it would be in many DB engines by implicitly using no index at all and destroying performance and scalability, but App Engine just doesn't implement "features" that inevitably destroy scalability and performance). If you had word separation, you could use the rudimental full-text search explained here, but, as that blog says,

it has no exact phrase match, substring match, boolean operators, stemming, or other common full-text features.

nonrel-search is mentioned here as "alternative" and "similar" to an older, now-discontinued project by the same authors called gae-search (which was available in free and for-pay versions -- maybe the for-pay version could help you, I've never investigated it in depth -- and the last link I gave has contact info for the authors, who might be happy to develop something for you if your budget is sufficient to fund a costly project like this).

Problem is that the number of substrings of each given string grows quadratically with the string's length, so the indices needed for a reasonably fast search of the unbounded kind you want would also grow terribly big extremely fast. You could apply some optimizations if you have bounded lengths for the strings you're storing and those you're searching, but it's still a pretty hard problem to solve with any halfway tolerable efficiency.

Maybe you can explain exactly what you're trying to obtain with this hypothetical "arbitrary substring search" so that the numerical limits on the strings (and substrings being searched for), in both lengths and numbers, can be assessed. The exact problem you want to solve perhaps, if your numerical limits are not tight (as you've currently expressed your problem, there seems to be no limits whatsoever -- but hopefully that's not really the case!-), might not be practically solvable, but maybe some variant / subset of it might... but you'll need to explain the exact problem in question to allow would-be helped to think about such subsets and variants!

Edit: given a slight clarification in the OP's edit of his Q, the heuristics I would suggest is to pick some sensible max and min "relevant substring length" (say 2 and 5, call them MINRSL and MAXRSL for definiteness). When a string (domain name) is entered, break it up by dots if appropriate (e.g. you don't want to allow searches "across" the dots), possibly discard some parts (you don't want to explicitly record all the .com, .org &c suffixes do you? anyway, this decision is app-specific), and, for each of the other parts that you do want to be searchable, do some indexing for substrings of lengths MINRSL to MAXRSL.

Specifically, with the limits 2 and 5 given above, and assuming www. and .com can be removed (much like one usually removes words like "and", "the", "of", ... in full text search: such "stop-words" are just too common and a search for them -- in return for the enormous cost of indexing them -- would return useless tons of unrelated documents), you would have to consider as indexables:

go oo og gl le
goo oog ogl gle
goog oogl ogle
googl oogle

so, you'd need to create 5 + 4 + 3 + 2 = 14 instances of a model which has the indexable as one field and, as the other field, a reference to the instance where you stored www.google.com. Like all indexing schemes, of course, this makes it onerous to "write" (create new object, or, even worse, alter the indexed parts of existing ones!-) as the price you pay for very fast "read" (searching).

Alternatively, for cheaper writing but costlier reading (searching), you could record just substrings of a certain single length, say 4 -- that would be just (ideal case oversimplified, see later):

goog oogl ogle

i.e. three instances of said auxiliary model, instead of fourteen. But now, for searching, you need to truncate the substring being searched for to four characters, get all the matches which will include some false positives, and use extra code in your application to filter the "possible hits" thus found to eliminate the false positives.

When the user searches for a shorter string, say just "oo", you can locate all the matches that start with "oo" (by using both a >= and a < in your search: >= "oo", but also < "op", the next possible length-two string). However, and this is the oversimplification in the above paragraphs, this doesn't work for shorter-substring searches that don't appear at the start of substrings of length four -- so you have to add the "trailing indexables"

gle le

(for a total of 5, rather than 14 with full indexing) to this more complicated but more balanced scheme.

Note that, in the other complete model, you still need the code to eliminate false positives when needed -- if you've set MAXRSL to 5, and the user looks for a substring of length, say, seven, you either give an error, or truncate it to five and apply the same code I was mentioning above.

Can you afford the simpler, faster-for-searching "complete indexing from MINRSL to MAXRSL" architecture? It depends on your number. If you have, in all, about 2,000 indexed URLs with a total of, say, 4,000 "words" in them to index, all (for simplicity) we'll say of length 8 characters, the MINRSL=2, MAXRSL=5 scheme requires 7+6+5+4 indexables per word, i.e., 22 per word, multiplied by 4,000 is only 88,000 entries, which would be quite affordable. But if you have many more words to index, or substantially-longer words, or need much-vaster ranges of min to max RSL, then the numbers can grow worrisome (and this is the case in which the savings of, say, a factor of three, for the more complicated slower-to-search scheme, may be considered worthwhile). You don't give us any numbers, so of course I can make no guess.

As you see, even this simple idea leads to needing pretty complicated code -- which you're unlikely to find already available as open source, since the requirement is pretty peculiar (very few people care about "arbitrary substrings of DNS names" as you appear to) -- whence my suggestion that, unless you're confident about developing and tuning all this code in house, you consider contacting expert professionals such as those mentioned above to get a quote for developing such code for you (of course, you'll have to give them the numbers you haven't given us, including how big the auxiliary indices are allowed to become, in order to allow them to make a preliminary feasibility assessment before bidding on your requirements).

Alex Martelli 2010-07-28 04:31:54

Thanks for the detailed analysis. I have updated the problem statement.

demos 2010-07-28 04:48:05

You don't need many model instances - just a list with all the index strings. And given that you can do prefix matches by querying for a range, there's no need to index arbitrary substrings - just every suffix of the original string.

Nick Johnson 2010-07-28 08:50:39

@Nick, how do you "query for a range" given a _list_ of index strings? Considering that `>= 'oo'` will be true if **any** item in the list is "oo" or later, and `< 'op'` if **any** is less than "op", the alleged "range" obtained by `and`ing those would, it seem to me, be wildly crowded with false positives. Can you perhaps post a separate answer with pseudocode to clarify how you mean to use both lists and range queries? TIA Nick!

Alex Martelli 2010-07-28 20:27:21

The correct, production behaviour for inequalities on a list is that the inequalities match on the same index entry - so you can use a range to do a prefix match on an item in a list. This is also matched by the behaviour of the new sqlite stub in the SDK; it's only the old datastore stub that has the buggy behaviour you're referring to.

Nick Johnson 2010-07-29 07:55:12

ansaurus

tags:

views:

answers:

Substring search in GAE using python?

related questions