views:

64

answers:

1

Hello All,

I am trying to develop a complex textual search engine. I have thousands of textual pages from many books. I need to search pages that contain specified complex logical criterias. These criterias can contain virtually any compination of the following:

A: Full words.

B: Word roots (semilar to stems; i.e. all words with certain key letters).

C: Word templates (in some languages roots are filled in certain templates to form various part of speech such as adjactives, past/present verbs...).

D: Logical connectives: AND/OR/XOR/NOT/IF/IFF and parentheses to state priorities.

Now, would it be faster to have the pages' full text in database (not indexed) and search through them all using SQL and Regular Expressions ?

Or would it be better to construct indexes of word/root/template-page-location tuples. Hence, we can boost searching for individual words/roots/templates. However, it gets tricky as we introduce logical connectives into our queries. I thought of doing the following steps in such cases:

1: Seperately search for each individual words/roots/templates in the specified query.

2: On priority bases, we merge two result lists (from step 1) at a time depedning on the logical connective

For example, if we are searching for "he AND (is OR was)":

1: We shall search for "he", "is" and "was" seperately and get result lists for each word.

2: Merge the result lists of "is" and "was" using the merging function OR-MERGE.

3: Merge the merged result list from the OR-MERGE function with the one of "he" using the merging function AND-MERGE.

The result of step 3 is then returned as the result of the specified query.

What do you think gurues ? Which is faster ? Any better ideas ?

Thank you all in advance.

+1  A: 

There are plenty of off-the-shelf solutions to this kind of problem. I would strongly recommend you use one of those instead of developing your own.

You don't say what database solution you're using. If it's Microsoft SQL Server, you could use its Full Text Search features. If it's MySQL, take a look at its Full-Text Search Functions. I'm sure Oracle, DB2 and any other major DBMS will have similar functionality.

Alternatively, take a look at Apache's Lucene for Java or Lucene for .NET. This will allow you to index documents without needing to use a DBMS.

Daniel Renshaw
Thank you for your time to answer my Q. I found it extra feasibly to use BerkeleyDB from Oracle with merging functions of binary-search performance. My decision was due to the complexity of support needed for the Arabic language that is not covered by Full-Text DBs as far as i could research.
geeko