i have hundreds of thousands of pages i need to scan and document

views:

224

answers:

+1 Q:

i have hundreds of thousands of pages i need to scan and document

i have many documents that i want to scan. every document will have about 10 different meta data tags by which i want to be able to search.

so maybe i am thinking of getting a huge scanner, scanning everything in, but then how do i label evverything? i guess i will turn them into pdf files and i will put them in a mysql DB? what is the best way to do this. i also want to make a GUI to be able to search through this database. i do not want to OCR all the documents i just want to attach like 10 keywords for every document.

please suggest to me a system or a procedure of how to do this. i want this to be searchable probably from multiple computers

what kind of programming is required?

+1 A:

Your best bet may be to look through some open-source document management systems with the features you want, and see how they do it. This kind of functionality is pretty common.

Robert Harvey 2009-11-03 20:07:37

+1 A:

Instead of a database for this, why not leverage the Windows Search functionality, and Windows abilities to assign categories and keywords to files.

Syntax Information: http://www.microsoft.com/windows/products/winfamily/desktopsearch/technicalresources/advquery.mspx

How to use Search in a WPF application: http://www.codeproject.com/KB/WPF/Vista%5FSearch%5Fin%5FWPF.aspx

Edit for Additional Details Okay, here's the concept, all these files that you're scanning in, they're being digitized, and then you'd have some process for naming them, and placing data in they META data for the file (categories, keywords, etc)

For the Service, you'll want to write a Windows Service which contains a THREADPOOL. What you would want is that for every SEARCH request that comes in, you would spin off a new thread to perform the actual search. The threadpool would keep the system from being overtaxed and basically manage these threads for you.

The application on the worker computers, would make a search request possibly by writing a message to a MS Message Queue on the server, and then wait for the response from the service (again, possibly via message queues, but you have a number of different options here this communication). When this response comes back, you would then update your UI with a list of file names/locations for the user to view, decide, etc.

Stephen Wrighton 2009-11-03 20:09:05

this is not a very robust way

I__ 2009-11-03 20:59:40

define robust? you could easily generate a service which runs on the server these files are stored on to asynchronously perform the actual searches. In your process to turn the scanned image into a pdf, assign the keywords, and then you get everything you asked for in the OP. And this is using native systems meaning you get by without the overhead of a RDBMS installed.

Stephen Wrighton 2009-11-03 21:29:41

plus one for uuuuuuu

I__ 2009-11-03 23:31:18

can you please show me how to do what you are saying

I__ 2009-11-03 23:31:59

+5 A:

I recently helped my wife make digital backups of her 30 years of creative writing. She had about 15,000 pages written in longhand in hundreds of small notebooks.

We tried using a flatbed scanner, but the notebooks don't lay flat, her scanner takes up to 60 seconds to scan a page, and some notebooks were larger, and didn't fit on her letter-size flatbed scanner. I know bigger, faster scanners exist, but it was still too clumsy and time-consuming.

We ended up with a digital camera mounted on a small tripod, pointing straight down at a table where the book is laying open. Use the camera's AC adapter, so you can go for hours without needing to change batteries. Some cameras can even be operated from a GUI on the computer, so you don't risk moving it by pressing the controls. If you get this all set up conveniently, you can flip pages rapidly and take a photo every few seconds. This solution was a lot faster.

We found it was best to take all the photos for a book, and then as a separate task, offload them to the computer and categorize and archive them. Just because it would slow us down to change from the camera UI to the cataloging UI for every page.

Most people don't bother to store large images in an RDBMS, they just store the image's filename as a string and then add columns for other attributes like title, date, and keywords. The exception is if you need the images to obey ACID transactions and such, which probably doesn't apply in your case.

If you aren't going to do OCR, I can't think of a way to detect keywords automatically. You'll have to enter them manually or choose them from a list. But again, this is best done as a "post-processing" task after you've captured the images.

Bill Karwin 2009-11-03 20:21:19

+1 A:

This project has several aspects that can be addressed separately:

Scanning. Can the sheets be separated and fed through a sheet feeder? If yes, go for a document scanner like fujitsu fi-6140 or similar. Works great, up to 3000 pages a day. Still a lot of work, mind you.

If not, go for a camera setup. Look at http://diybookscanner.org/ and similar professional setups.

Expect to invest a minute per 10 to 100 pages, depending on system.

OCR. Works fine on printed text. Go for pdf with text in picture, so you dont have to proofread. Meaning you see the scanned picture in the pdf, onto which the ocr-ed text is superimposed. If this document gets printed, it is in effect a photocopy, but you can copy and paste the text from it.

Data storage and Retrieval. The solution for this depends very much on the plans you have for your data.

How many people should access it? If alone, a file system solution might be ok. If many, think about a digital library system like Dspace or Greenstone Digital Library.

posipiet 2009-11-04 03:52:42

ansaurus

tags:

views:

answers:

i have hundreds of thousands of pages i need to scan and document

related questions