views:

181

answers:

1

I need to develop an IFilter for Microsoft Search Server 2008 that performs prolonged computations to extract text. Extracting text from one file can take from 5 seconds to 12 hours.

One idea to doing this is creating a preprocessing application.

How do I design such an application? Specifically: - how do I connect the Search Server crawler to my application? - how do I feed extracted text into Search Server once extraction is complete?

+2  A: 

First you will need to code the IFilter itself.

This article is quite good and it references some good articles too. IFilter.org Also see this set of articles

Next is the issue of how to pre-process. The easiest way I can think of is to create a FileSystemWatcher to kick of the pre-processing of the document.

The pre-processor can parse the text from the document and store it somewhere.

That "somewhere" becomes the next issue and that is primarily a business kind of decision. If the directory for the documents is okay to add to, I would add an Index directory in each folder as documents are parsed and store a file such as [OriginalFilenameSansExtemsion]_index.txt inside.

If that is not possible, create an Index folder on each drive and mirror the directory structure as needed. At the end of the day, all you need is for the IFilter to be able to determine based on the filename of the file being indexed where to look for the text document with it's pre-processed content.

When the IFilter runs, Init is called. When that happens, simply load the text document and return it's contents as the GetChunk, GetText and GetValue functions are called.

This solution will end up with an implicit dependancy between the pre-processor and the IFilter as they will both store thier own way of "finding" the index document.

It should be possible to store the location of index documents in some shared configuration location.

Update How will the IFilter method be called under Search Server? Once created, the IFilter will have to be installed on the indexing server (i.e. the relevant dll will have to be registered). Using this article as a guide, as part of your implementation, you will have given your filter a unique guid for it's CLSID. The registration process will then be similar to that, just using a different extension and guid.

STEP 1: COM REGISTRATION

1.Add Registry key: HKEY_CLASSES_ROOT\CLSID\ ThreadingModel : Both

STEP 2 : REGISTER IFILTER WITH OS

There are 4 steps to registering the filter-extension mapping with OS:

  1. HKEY_CLASSES_ROOT\<.ext>(Default) -->
  2. HKEY_CLASSES_ROOT\(Default) -->
  3. HKEY_CLASSES_ROOT\\PersistentHandler(Default) -->
  4. HKEY_CLASSES_ROOT\\PersistentHandler\PersistentAddinsRegistered\IID_IFilter\ (Default) -->

Now we're all set to regiter our product with WSS (Windows Sharepoint Services) or MOSS( Microsoft Office Sharepoint Server).

STEP 3: REGISTER FILTER EXTENSION WITH MOSS

  1. Add the filter-extension to the File types crawled: Start -> Program -> Microsoft Office Server -> SharePoint 3.0 Central Administration -> -> Search Settings -> File Types -> New File Type (Add extension here)

  2. Add the following registry keys:

    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0 \Search\Applications\\Gather\Portal_Content\Extensions\ExtensionList]

    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Setup\Filters.ext] Default = (value not set) Extension = FileTypeBucket REG_DWORD = 0x00000001 (1) MimeTypes =

    [HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office Server\12.0\Search\Setup\ContentIndexCommon\Filters\Extension.ext] Default REG_MULTI_SZ = IFIlter CLASSID

  3. Finally, recycle the Search Service by executing the following command from the command window:

    D:> net stop osearch

    D:> net start osearch

Does the Search Server passes an URL and not the local file name? The LoadIFilter function is where you will have the pathname for the file. It is here that you create the instance of the IFilter that reads the indexed text instead of the actual file.

What will I do if it calls IFilter::Init for an URL which is not indexed yet? If the indexed file does not exist, you will not be able to index, so return one of the available error codes.

A pre-processing application will need to extract the text from a document if that takes a long time. The text will need to be stored where the IFilter can access it when it comes to process the file during the LoadIFilter function (which is passed the url/filepath of the file by the search application). Using the url/filepath of the file, the Ifilter must be able determine where the previously extracted text is. When the IFilter then can load the text and parse it instead of the "actual" file. Bypassing the need for long search crawl times.

If you aren't going to get the pre-processor to do entire sites, it would take multiple passes of the search crawler to get what you require. Assume the crawler is doing an incremental crawl every evening. The first day a file is added, the incremental crawl picks up the file and passes it to the LoadIFilter. The function looks and cannot see any pre-processed text for the file, so it adds the path to a config file (or list) and returns an error code. The file does not get added to the search results. The pre-processor, at a different time, looks at the config list sees that there is a file to be processed and starts the work. when it finished, it stores the text and removes the file from the config list. The next time the crawler runs, it will find the file and its stored text for parsing.

This process is starting to get a bit complex and I would worry about the crawler and the pre-processor having to communicate so well. Also, the incremental crawl may need the pre-processor to "touch" the file once it has had it's text extracted.

At this point, it may be best to develop something and see how what happens as so far this is just a theoretical algorithm.

Hope this is helpful.

Nat
That's clear, but how will the IFilter method be called under Search Server? I guess the Search Server passes an URL and not the local file name. Also what will I do if it calls IFilter::Init for an URL which is not indexed yet?
sharptooth
I just don't get what file name is passed to IFilter::Init(). Search Server indexes a site, so it walks through a hierarchy of URLs. How do this URLs translate into file names passed into IFilter::Init()?
sharptooth
I also don't get how this approcah deals with long processing. The SS wishes to index some URL and loads the IFilter. Text extraction takes very long. So the IFilter asks the helper application to perform extraction. But what the filter itsself should do at this moment?
sharptooth
The SharePoint search crawler handles passing the files to the IFilter, so you will not have to care. You can search over a file system as well.
Nat
This approach totally separates the long process from the search filter. The processing of the text is a completely separate process.
Nat
Do you mean that the crawler retrieves the files and copies them to the machine where the IFilter is installed and passes local file names? Are the file names unique then?
sharptooth
So the crawler passes a file to the IFilter, but the IFilter will require some long time to run. How should it indicate that it needs long time and hasn't hung up and doesn't refuse to work on this very file?
sharptooth
The problem is that the preprocessing application doesn't know what files to extract text from until the Search Server has called the IFilter. What will it preprocess then? Only the crawler knows what is on the site, but when it tries to load the IFilter, it's tool late for heavy extraction.
sharptooth
The pre-processor can easily be look through the site, either through the objet model, or web services to find all documents that need to be extracted.
Nat
Preprocessor looking through the site would be a duplicate crawler and it would have to either sync settings with the crawler or be configured separately. That's not very good. I guess I'll try to implement the approach when the preprocessor will be called through IFilter only and tell how it works.
sharptooth