How do I extract sections (multiple sections per page, multiple pages) of a word document/pdf/image as separate images/word documents/pdfs? | ansaurus

tags:

views:

28

answers:

2

Q:

How do I extract sections (multiple sections per page, multiple pages) of a word document/pdf/image as separate images/word documents/pdfs?

Here's the basic problem: I have about 10,000 word documents that contain blocks of data. Each block is numbered and also has an accompanying image. I need to somehow store these individual blocks to a db as images (text would be great, but read note below), without the numbering.

I can go through and have typists mark the beginning and ends of the blocks using a ###QUESTIONSTART###, ###QUESTIONEND### or whatever. I am trying to take that document, convert it to a big image, look for those tags, extract the part in between the tags as an image and then move on to the next block.

I've been looking at some APIs and I think I can definitely crop the images once I figure out how to get the coordinates of each start/end marker. Any suggestions? I'd hate to write a pixel by pixel matcher that has to go O(no of blocks * n^2)

NOTE: These blocks contain complex equations/math type stuff hence the images. I don't have the $$ to get 1000 typists trained in TeX and retype the whole deal. OCR doesn't cut it yet.

A:

I don't understand all your question, but in my impression, Tika can help you.

Istao 2010-06-30 10:44:29

Tika currently does only text/mime-type parsing. Not sure if I could use it for spitting out images.

kdawg 2010-06-30 11:29:04

A:

If you can have typists add block marks to 10,000 documents, why can't the typists

Open the Word document
Copy the image from the Word document
Paste the image into Paint
Save the image to their disk?

You can come up with a image naming scheme that makes sense to you and your typists.

Then you can collect the images from the disk drives with a program and load them into your database.

Gilbert Le Blanc 2010-06-30 16:00:34

related questions

Displaying Flash content in a C# WinForms application

How to get the value of built, encoded ViewState?

Unhandled Exception Handler in .NET 1.1

How do I connect to a database and loop over a recordset in C#?

How do I most elegantly express left join with aggregate SQL as LINQ query

Get a new object instance from a Type in C#

.NET Testing Framework Advice

Automatically update version number

What is the difference between an int and an Integer in Java/C#?

How to write to Web.Config in Medium Trust ?

WinForms ComboBox data binding gotcha

How do you sort a C# dictionary by value?

Adding Scripting functionality to .NET applications

Floating Point Number parsing: Is there a Catch All algorithm?

How do I print an HTML document from a web service?

Decoding T-SQL CAST in C#/VB.net

Anatomy of a "Memory Leak"

How do I get a distinct, ordered list of names from a DataTable using Linq

Reliable Timer in a Console Application

How do I fill a DataSet or a DataTable from a LINQ query resultset ?

What's the difference between Math.Floor() and Math.Truncate() in .NET?

How do I calculate relative time?

How do I calculate someone's age in C#?

Are there any conversion tools for porting Visual J# code to C#?

When setting a form's opacity should I use a decimal or double?