views:

236

answers:

0

I'm having a bizarre problem with Tesseract. I have a name, "Janice" that is in a 200x40 pixel tiff, that Tesseract interprets as a blank. I'm running hundreds of names through Tesseract and they are processed fine.

What I'm actually doing, though, is breaking up a larger TIFF into smaller tiffs of one word each. In the larger TIFF, tesseract recognizes "Janice".

What could cause it to hiccup in a TIFF that solely contains that word (and there's enough space around the word to not truncate any of the pixels)? I'm using ImageMagick to split the big TIFF, are there options I should set when reconstituting the new TIFF files?

Running identify -verbose on the file doesn't bring up any obvious indicators:

Here's the Identify output of a TIFF that does process correctly:

Format: TIFF (Tagged Image File Format) Class: DirectClass
Geometry: 278x30+0+0 Resolution: 72x72 Print size: 3.86111x0.416667
Units: PixelsPerInch Type: Grayscale Base type: Grayscale Endianess: MSB Colorspace: RGB Depth: 8-bit
Channel depth: gray: 8-bit Channel statistics: gray: min: 2 (0.00784314) max: 255 (1) mean: 200.754 (0.787272) standard deviation: 92.7839 (0.363858) kurtosis: 0.0583426 skewness: -1.32866 Rendering intent: Undefined Interlace: None
Background color: white Border color: rgb(223,223,223) Matte color: grey74 Transparent color: black Page geometry: 278x30+0+0
Dispose: Undefined Iterations: 0
Compression: None Orientation: TopLeft Properties: date:create: 2010-04-19T18:22:12-04:00 date:modify: 2010-04-19T18:22:12-04:00 signature: ca2dc35810a3e9968157965d68a461b98238e40de4a4b4595a721659591a687f tiff:document: ./cells//63/63-row0019col0000.tif tiff:photometric: min-is-black tiff:rows-per-strip: 29 tiff:software: ImageMagick 6.5.4-9 2010-04-16 Q8 http://www.imagemagick.org
Artifacts: verbose: true Tainted: False Filesize: 8.49kb Number pixels: 8.14kb Version: ImageMagick 6.5.4-9 2010-04-16 Q8 http://www.imagemagick.org

Here's the one that doesnt:

Format: TIFF (Tagged Image File Format) Class: DirectClass Geometry: 278x30+0+0 Resolution: 72x72 Print size: 3.86111x0.416667 Units: PixelsPerInch Type: Grayscale Base type: Grayscale Endianess: MSB Colorspace: RGB Depth: 8-bit Channel depth: gray: 8-bit Channel statistics: gray: min: 2 (0.00784314) max: 255 (1) mean: 209.057 (0.81983) standard deviation: 86.9067 (0.340811) kurtosis: 0.975473 skewness: -1.62463 Rendering intent: Undefined Interlace: None
Background color: white Border color: rgb(223,223,223) Matte color: grey74 Transparent color: black Page geometry: 278x30+0+0
Dispose: Undefined Iterations: 0
Compression: None Orientation: TopLeft Properties: date:create: 2010-04-19T18:22:12-04:00 date:modify: 2010-04-19T18:22:12-04:00 signature: 58ac6a0b2cf027f52d0fa373e725c55eff8c4f9b682c3e2e85adab8513fdbad2 tiff:document: ./cells//63/63-row0020col0000.tif tiff:photometric: min-is-black tiff:rows-per-strip: 29 tiff:software: ImageMagick 6.5.4-9 2010-04-16 Q8 http://www.imagemagick.org
Artifacts: verbose: true Tainted: False Filesize: 8.49kb Number pixels: 8.14kb Version: ImageMagick 6.5.4-9 2010-04-16 Q8 http://www.imagemagick.org