views:

22

answers:

0

Greetings Overflowers,

I'm currently not using full-text indexing/searching capabilities of any relational database engine. This is because they do not satisfy my application complex query requirements:

  1. searching by Arabic root/template/stem/diacritics... etc ("/" stands for and/or)
  2. using logical operators between them (in 1) such as AND/OR/XOR/NOT/IF/IFF
  3. specify proximity (distance, e.g. in words, between result occurrences connected by these operators (in 2)

So, I came up with an indexing technique and a fast searching algorithm. However, so far it can only extract all the huge result set from the database, which can be inefficient when running complex queries for many users concurrently. I need a way to get only a specific page (say results no. 100-150) of the entire result set. My way is not allowing me to do it without at least going through the motion of extracting everything before that specific page without actually extracting anything, which is a waste if the user asked for the last page then jumped to the second. Plus, I might not be able to cache all the results for every query and every user.

Any ideas on resolving this issue ? And how does Google do it ? Do they cache everything or do they do on-demand extraction of specific pages from their data stores ?

Details on my current approach:

  • Indexing: I ordered root/template/stem/diacritics... etc alphabetically and record thier specific word position in their original documents.
  • Searching: I use a parallel divide and conquer strategy. For example, for the query "rootX AND:3 templateY" I do the following:

    1. Divide: create two threads: one for the rootX and another for the templateY
    2. Each thread will extract all occurrences in all documents: one for rootX and the other for templateY
    3. Conquer: since the query connects both result extraction of rootX and templateY by the logical operator AND, only rootXs and templateYs occuring in the same documents separated by 3 word distance (position of rootX - position of templateY) will be included in the final result set. Therefore, results from both threads will be merged by the main thread using the logical operator AND conditioned by the specified distance

So the question is: how to improve my strategy in order to allow for random access to specific pages in the huge and final result set without actually having to extract the entire final result set ?

Many thanks for you all !