tags:

views:

85

answers:

1

After a lot of experiments, I still can't get the following script working. I need some guidance on how to diagnoze this particular Perl problem. Thanks in advance.

This script is for testing the use of Office 2007 OCR API:

use warnings;
use strict;
use Win32::OLE;
use Win32::OLE::Const;

Win32::OLE::Const->Load("Microsoft Office Document Imaging 12\.0 Type Library") 
or 
die "Cannot use the Office 2007 OCR API";
my $miDoc = Win32::OLE->new('MODI.Document') 
or die "Cannot create a MODI object";    
#Loads an existing TIFF file
$miDoc->Create('OCR-test.tif'); 
#Performs OCR with the OCR language set to English
$miDoc->OCR(LangId => 'miLANG_ENGLISH'); 
#Get the OCR result
my $OCRresult = $miDoc->{Images}->Item(0)->{Layout}{Text}; 
print $OCRresult;

I did a small test. I loaded an .MDI file containing the OCR information. I deleted the OCR method line and ran the script and I got the expected text output of "print $OCRresult". But otherwise, Perl throws me the error saying

Use of uninitialized value $OCRresult in print at E:\OCR-test.pl line 15

I'm suspecting that something's wrong with the line

$miDoc->OCR(LangId => 'miLANG_ENGLISH'); 

I tried leaving the parens empty or using three paraments, like 'miLANG_ENGLISH',1,1 etc but without any luck. I also tried using Microsfot Office Document Imaging to test if the TIF I'm experimenting with was text recognizable and the result was positive.

So what other diagnostic methods do I have?

Or can someone who happens to have Office 2007 test my code with a whatever jpg,bmp or tif pictures that have text content and see if something's wrong?

Thanks in advance.

UPDATE

Haha, I've finally figured out where the problem is and how I can solve it. @hobbs, thank you for leaving the comment :) Things are interesting. When I was trying to respond to your comment, I added the link of the url of Office Document Imaging 2003 VBA Language Reference and I took yet another look at the stuff there. And the following information caught my eyes:

LangId can be one of the following MiLANGUAGES constants.
miLANG_CHINESE_SIMPLIFIED (2052, &H804)

I changed the following OCR method line:

$miDoc->OCR('miLANG_ENGLISH',1,1);

to this:

$miDoc->OCR(2052,1,1); 

A few notes: 1. I'm running ActivePerl 5.10.0 on Windows XP (Chinese version) 2. Before this, I already tried $miDoc->(9) but without luck

And suddenly and kind of magically that pesky ERROR saying "Use of uninitialized value $OCRresult in print at E:\OCR-test.pl line 15" disappeared completely and the OCRed text appeared on the screen. The OCR result was not satisfying but the parameter "2052" refers to Chinese and the TIF image contains all English. So I changed the parameter to $miDoc->OCR(9,1,1) but this time without luck. Windows threw me this error:

unknown software exception (0x0000000d)

I changed the TIF image to one that contains all Chinese characters and changed the parameter to "$miDoc->OCR(2052,1,1);" again and this time everything worked just like expected. The OCR result was satisfying.

Now I think there's something weird about my Office 2007 OCR API and if someone who happens to run Windows XP (English version) and have installed Office 2007 would probably not encounter that exception error with the parameter

$miDoc->OCR(9,1,1); 

Anyway, I'm really happy that I've finally get things working :D

+3  A: 

For starters I would try dumping the value of $miDoc->{Images} -- does it exist? If it exists and it's a collection does it contain anything? If it contains anything, what is it? An error? Or maybe just a different structure than you're expecting? warn, Dumper, and a little exploration can go a long way.

Incidentally, if you want to do the "modern" thing and don't mind grabbing a nifty tool off of CPAN, try Devel::Dwarn -- it makes dumping to stderr even more fun than it was already :)

hobbs
@hobbs, thanks for the hint. Actually I had done the "print Dumper ($miDoc->{Images});". It gave me $VAR1 = bless( { 'Count' => 1, '_NewEnum' => undef, 'Item' => undef }, 'Win32::OLE' );
Mike
@hobbs, from my MDI file test, I think using $miDoc->{Images}->Item(0)->{Layout}{Text} to access the OCRed result is alright. I doubt the OCR method is not right but I'm not sure where I am doing wrong. BTW, I'm using this documentation for reference (http://msdn.microsoft.com/en-us/library/aa202819%28office.11%29.aspx)
Mike
@hobbs, haha, looks like things are getting interesting :) There's indeed something's suspcious about the Create method. I'm making progress. I'll update the post soon.
Mike