views:

168

answers:

1

Hi all,

Do any of you have any experience with using Oracle Text to search for content inside PDF files?

I have a table, with a field called FILEDATA(blob).

I would like to do the following query:

SELECT id FROM ttc.contract_attachment WHERE CONTAINS(filedata, 'EXAMPLE') > 0;

However, i'm not too sure about the type of index to add to it.

I found the following code:

begin 
  ctx_ddl.create_preference('doc_lexer', 'BASIC_LEXER'); 
  ctx_ddl.set_attribute('doc_lexer', 'printjoins', '_-'); 
end; 
/ 

create index idxContentMgmtBinary on CMDEMO.CONTENT_INVENTORY(TEXT) indextype is ctxsys.context
  parameters ('lexer doc_lexer sync (on commit)');

Ref: http://www.devx.com/dbzone/Article/21563/1954

I have no idea what BASIC_LEXER is. I'm at a bit of a loss. I shall endeavour to continue searching for an answer. Any help would be great.

Thanks.

A: 

I've used Oracle Text to index not only PDF's but other data like XML structures. Oracle has the concept of lexers which take content and parses, tokenizes and indexes the tokens. The basic lexer handles English words, there are other lexers for Chinese, Japanese, Korean, etc. The printjoin attribute allows you to index characters that are normally excluded such as hyphes, quotes, etc.

The index you have defined above will work. Keep in mind that Oracle Text indexing is an asynchronous process, meaning the commit occurs and then sometime in the future the document is indexed. However you will need to synchronize the index as part of a scheduled job or the like. With the option "sync (on commit)" on your index, it will index the document as part of the transaction. This is noteworthy only if you are indexing sizable PDF documents.

I would recommend utilizing progressive relaxation for any search you may want to run, as it can being with a restrictive search and expand out to a more generic search, thereby providing the user with results that are decreasing in relevancy. For instance:

    <query>
   <textquery lang="ENGLISH" grammar="CONTEXT"> cat dog
     <progression>
       <seq><rewrite>transform((TOKENS, "{", "}", " "))</rewrite></seq>
       <seq><rewrite>transform((TOKENS, "{", "}", "AND"))</rewrite></seq>
       <seq><rewrite>transform((TOKENS, "{", "}", "ACCUM"))</rewrite></seq>
     </progression>
   </textquery>
  <score datatype="INTEGER" algorithm="COUNT"/>
</query>

The above query tokenizes the search keywords "cat dog" attempts to find them as a phrase, then any documents contains cat AND dog (not necessarily beside each other), then any document containing cat OR dog, documents containing both words are scored higher than if a document just has a single one. Futhermore the structure automatically dedups the results as it returns them.

All of that being said, you could simply define your index as:

create index idxContentMgmtBinary on CMDEMO.CONTENT_INVENTORY(TEXT) indextype is ctxsys.context
  parameters ('sync (on commit)');

and it would probably work very well for your needs. You would only need to change the behavior of the lexer if you have a need for doing so. I hope this helps.

Nick