Here's the basic problem: I have about 10,000 word documents that contain blocks of data. Each block is numbered and also has an accompanying image. I need to somehow store these individual blocks to a db as images (text would be great, but read note below), without the numbering.
I can go through and have typists mark the beginning and ends of the blocks using a ###QUESTIONSTART###, ###QUESTIONEND### or whatever. I am trying to take that document, convert it to a big image, look for those tags, extract the part in between the tags as an image and then move on to the next block.
I've been looking at some APIs and I think I can definitely crop the images once I figure out how to get the coordinates of each start/end marker. Any suggestions? I'd hate to write a pixel by pixel matcher that has to go O(no of blocks * n^2)
NOTE: These blocks contain complex equations/math type stuff hence the images. I don't have the $$ to get 1000 typists trained in TeX and retype the whole deal. OCR doesn't cut it yet.