views:

80

answers:

3

link textI want to use the TIFF IFilter built in to Windows 2008 Server R2 with Full-Text search in SQL Server 2008... also R2.

I have installed the filter through server manager and updated the "Force TIFF IFilter to perform OCR for every page in a TIFF document" Local Group Policy setting in Computer Configuration -> Administrative Templates -> OCR to "Enabled."

I have also created a full-text catalog and a table called "FileData" that looks like this:

CREATE TABLE [FileServer].[FileData](
 [FileDataId] [int] IDENTITY(1,1) NOT NULL,
 [FileGUID] [uniqueidentifier] ROWGUIDCOL  NOT NULL,
 [Data] [varbinary](max) FILESTREAM  NOT NULL,
 [Extension] [nvarchar](100) NULL,
 [Filename] [nvarchar](256) NULL,
 [Path] [nvarchar](256) NULL,
 CONSTRAINT [PK_FileData_FileDataId] PRIMARY KEY CLUSTERED 
(
 [FileDataId] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY] FILESTREAM_ON [FILES],
 CONSTRAINT [UX_File_FileGUID] UNIQUE NONCLUSTERED 
(
 [FileGUID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY] FILESTREAM_ON [FILES]

GO

SET ANSI_PADDING OFF
GO

ALTER TABLE [FileServer].[FileData] ADD  CONSTRAINT [DF_FileData_FileGUID]  DEFAULT (newid()) FOR [FileGUID]
GO

ALTER TABLE [FileServer].[FileData] ADD  CONSTRAINT [DF_FileData_FileData]  DEFAULT (0x) FOR [Data]
GO

When I insert a file into that table, like a PDF or word DOC, I can hit keywords in the file moments later with a fulltext search:

I made a big huge TIFF file with very clear text (1024 x 768... about 12 words) and imported THAT into the FileData table. I can find every word in it.

SELECT [Path], [Filename], [Data]
FROM [FileServer].[FileData]
WHERE FREETEXT(*, 'Jason') and FREETEXT(Extension, 'tif');

However, when I use a "real" TIFF file, like a datasheet from a manufacturer, I get ZERO results when searching for keywords. I do not have a clue as to why, and there is not much online troubleshooting this with SQL Server.

I have tried saving the .TIFF file with various kinds of compression, without compression, etc... and I am just not having any luck. The text in my test file is CRYSTAL clear and still pretty large. I cannot imagine the the file clarity is the problem, allthough I suppose that is possible.

Just so you would have something to compare, I took the following two images and imported them:

WORKING SAMPLE FILE BROKEN SAMPLE FILE

The results for the working sample are REALLY good. These are the keywords from the working sample in the full-text index: $3.50 © 0004 08 1989 2010 21 21:35:42 235 282 3116 3702 40 48109 89 abounds absorb abstract accompanied acquired act action advantages agency algorithm algorithms already amounts amsterdam analyze ann appeared applications arbor arnficioj artficia1 assignment b.v. based basis booker brigade bucket building bv capabilities carefully changing characteristics checkers classifier classtfier closing cognitive comparing competing complex complexities complexity computer confronting confuse consider continual continually continuously contrived credit cures d.e. data de decent defined definition design designed devising discovery discussion disturbing during ecological economic eecs effort elsevier END OF FILE engineering environment environments err even events example exhibit experience expressed extant extensions face faces feasible file firing first flow following format game generates generic genetic giving goals goldberg good holiadd holland however hypotheses image immersed immune impinging implicitly inexactly information intelligence interest intervene introduction irrelevant j.h. jh journal l.b. large lb learn learning lifespan long machine mammal mammalian mammal's massively message mi michigan new nn0004 nn08 nn1989 nn2010 nn21 nn235 nn282 nn3116 nn3702 nn3d5$ nn40 nn48109 nn89 noisy north nos novel novelty obtainable often one operate option originally outside own paper parallel passing pattern payoff permission perpetual perpetually play player plays possible pretty problems provide publisher publishers quickly randomly rarely real realistic reinforcement repeatedly reprinted requirements retina reviews revise robotic rule rules science sequences sets significantly simple simply small sparse system systems tagged techniques theory thor tiff time tt2135 twice twists two typically u.s.a. university upon us usa visual vol without wonder world

But the results from the Broken Sample are just... well, vacant. Not a single word from the actual TIFF image: 08 2010 21 21:49:22 END OF FILE file format image nn08 nn2010 nn21 tagged tiff tt2149

If anybody has any ideas on what to try next, I'm ALL ears.

+1  A: 

Try convert the non-working image to black and white, and see if more words get recognized.

Added

Try use IrfanView (or any image tool) to set the DPI of the second image to 300. Then try again.

Obviously, these troubleshooting steps aren't permanent solutions, they just help isolate the problem.

rwong
+1  A: 

rwong is correct. You need to isolate the problem.

Not all OCR engines can process Color TIFF images and prefer B/W. I am guessing that the OCR Engine is not even processing your non working page and just issues an error message you cannot see.

  1. As per above try saving the file as a B/W TIFF image.
  2. Save the file as a JPEG and try recognising the image as a JPEG.

I ran your non working image through my OCR and was able to extract most of the text correctly so resolution is not a major issue.

Andrew Cash
A: 

Well, it turns out the actual problem was the SIZE of the image. The OCR in the ITFF IFilter just wasn't even attempting to process it... too big. I had to discover this by trial and error, and could not find any documentation stating the maximum size/DPI of the incoming TIFF. Anybody know these specs? This article appears to have some information: support.microsoft.com/kb/837847 But is specific to Sharepoint, and I have not had time to mess with the settings to see if it works. Also, I'd really need to just remove the size cap. Ideas there?

Eric
OK... My server is Server 2008 R2, and the registry keys in the aforementioned article to not even EXIST. However, I did find this value: HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\TiffIFilter\MaxImageSize. I'm really pretty dissappointed in the documentation on this particular IFilter... just doesn't seem to be much on its actual behaviour. It's probably fine for Sharepoint, but as a developer/SQL Admin, I need a bit more. Maybe MS will read this and update it for us.
Eric
OK, another facet to the problem. The registry value I have is 38797312. That should translate to about 388 MB, give or take. The image I had posted was pretty large, but not THAT large.
Eric
The source file is 25.823 inches x 34.458 inches as 96 DPI. Changing the DPI to 300 brings the size down to 9.163 x 12.228, but does not change the file size. This isolates the problem to the width/height of the document as dimensions, rather than the filesize or the WxH in pixels. I suppose I'll just convert all .TIFF files to 300 DPI on the fly, for now. I'm already converting PDF to TIFF, so changing that value in the procoess should not be a big deal.
Eric