First you will need to code the IFilter itself.
This article is quite good and it references some good articles too.
IFilter.org
Also see this set of articles
Next is the issue of how to pre-process.
The easiest way I can think of is to create a FileSystemWatcher to kick of the pre-processing of the document.
The pre-processor can parse the text from the document and store it somewhere.
That "somewhere" becomes the next issue and that is primarily a business kind of decision.
If the directory for the documents is okay to add to, I would add an Index directory in each folder as documents are parsed and store a file such as [OriginalFilenameSansExtemsion]_index.txt inside.
If that is not possible, create an Index folder on each drive and mirror the directory structure as needed.
At the end of the day, all you need is for the IFilter to be able to determine based on the filename of the file being indexed where to look for the text document with it's pre-processed content.
When the IFilter runs, Init is called. When that happens, simply load the text document and return it's contents as the GetChunk, GetText and GetValue functions are called.
This solution will end up with an implicit dependancy between the pre-processor and the IFilter as they will both store thier own way of "finding" the index document.
It should be possible to store the location of index documents in some shared configuration location.
Update
How will the IFilter method be called under Search Server?
Once created, the IFilter will have to be installed on the indexing server (i.e. the relevant dll will have to be registered).
Using this article as a guide, as part of your implementation, you will have given your filter a unique guid for it's CLSID.
The registration process will then be similar to that, just using a different extension and guid.
STEP 1: COM REGISTRATION
1.Add Registry key: HKEY_CLASSES_ROOT\CLSID\
ThreadingModel : Both
STEP 2 : REGISTER IFILTER WITH OS
There are 4 steps to registering the
filter-extension mapping with OS:
- HKEY_CLASSES_ROOT\<.ext>(Default) -->
- HKEY_CLASSES_ROOT\(Default)
-->
- HKEY_CLASSES_ROOT\\PersistentHandler(Default)
-->
- HKEY_CLASSES_ROOT\\PersistentHandler\PersistentAddinsRegistered\IID_IFilter\
(Default) -->
Now we're all set to regiter our
product with WSS (Windows Sharepoint
Services) or MOSS( Microsoft Office
Sharepoint Server).
STEP 3: REGISTER FILTER EXTENSION WITH
MOSS
Add the filter-extension to the File types crawled: Start ->
Program -> Microsoft Office Server ->
SharePoint 3.0 Central Administration
-> -> Search Settings -> File Types -> New
File Type (Add extension here)
Add the following registry keys:
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office
Server\12.0
\Search\Applications\\Gather\Portal_Content\Extensions\ExtensionList]
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office
Server\12.0\Search\Setup\Filters.ext]
Default = (value not set)
Extension =
FileTypeBucket REG_DWORD = 0x00000001 (1)
MimeTypes =
[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Office
Server\12.0\Search\Setup\ContentIndexCommon\Filters\Extension.ext]
Default REG_MULTI_SZ = IFIlter CLASSID
Finally, recycle the Search Service by executing the following command
from the command window:
D:> net stop osearch
D:> net start osearch
Does the Search Server passes an URL and not the local file name?
The LoadIFilter function is where you will have the pathname for the file. It is here that you create the instance of the IFilter that reads the indexed text instead of the actual file.
What will I do if it calls IFilter::Init for an URL which is not indexed yet?
If the indexed file does not exist, you will not be able to index, so return one of the available error codes.
A pre-processing application will need to extract the text from a document if that takes a long time. The text will need to be stored where the IFilter can access it when it comes to process the file during the LoadIFilter function (which is passed the url/filepath of the file by the search application). Using the url/filepath of the file, the Ifilter must be able determine where the previously extracted text is.
When the IFilter then can load the text and parse it instead of the "actual" file. Bypassing the need for long search crawl times.
If you aren't going to get the pre-processor to do entire sites, it would take multiple passes of the search crawler to get what you require.
Assume the crawler is doing an incremental crawl every evening.
The first day a file is added, the incremental crawl picks up the file and passes it to the LoadIFilter. The function looks and cannot see any pre-processed text for the file, so it adds the path to a config file (or list) and returns an error code.
The file does not get added to the search results.
The pre-processor, at a different time, looks at the config list sees that there is a file to be processed and starts the work. when it finished, it stores the text and removes the file from the config list.
The next time the crawler runs, it will find the file and its stored text for parsing.
This process is starting to get a bit complex and I would worry about the crawler and the pre-processor having to communicate so well. Also, the incremental crawl may need the pre-processor to "touch" the file once it has had it's text extracted.
At this point, it may be best to develop something and see how what happens as so far this is just a theoretical algorithm.
Hope this is helpful.