solution to OCR / search through 4 million pieces of paper and 10,000 added daily

tags:

database
ocr

views:

623

answers:

+8 Q:

solution to OCR / search through 4 million pieces of paper and 10,000 added daily

I work for a medical lab company. They need to be able to search through all their client data. So far they have a few years in storage about 4 million pieces of paper, and they are adding 10,000 pages per day. For data that is 6 months old, they need to access it about 10-20 times per day. They are deciding whether to spend 80k on a scanning system, and have the secretaries scan everything in house, or whether to hire a company like iron mountain to do this. Iron mountain will charge around 8cents per page, which adds up to around $300k for the amount of paper we have, plus a bunch of more money every day for the 10,000 sheets.

I am thinking that perhaps I can build a database and do all the scanning in house.

What are those systems that are used to scan checks and mail and they read really messy hand writing really well?
has anyone had experience building a database with a bunch of OCR'd searchable documents? What tools should I use for my problem?
Can you recommend the best OCR libraries?
As a programmer, what would you do to solve this problem?

FYI none of the answers below answer my questions well enough

+3 A:

update
using @eykanal idea as a starting point
examples of meta data that you would store would be a document id, a location for the source image and something to look up the record by (patient id, ssn or name etc). The 'record locator' data will probably need to be keyed in by data entry clerks looking at the physical form when they scan it.

original:

Not sure about what the check readers are called, but (at least for checks) they are only looking for numbers, so with such a restricted set of characters, they are much more accurate than general OCR is.

One thing to think about:
Take 10 seconds as an approximate per page time to scan.
Then 10,000 * 10 / 60 /60 = ~27.8 hours to scan your daily intake.

That means more than three full time employees JUST for scanning every day. That may be fine with you and your employer, but I would guess it is cheaper to outsource the scanning. Even 3 low pay employees combined after benefits etc is going to be > 100k / year.

Also:
In past experiences with xerox doc scanners, they resulted in about 50-100k of image data per page, depending on settings and not including the OCR text. Considering you are talking about medical records, you are probably going to need to store those as well (I can imagine there being legal issues if you don't). That means from 200 - 400 gigs for what you have, plus 1/2 to 1 gig per day.

BioBuckyBall 2010-07-16 22:30:46

BTW: A decent copier/scanner has a feed tray and can scan at something like 1-2 pages per second. If the people who do the filing now can be trained to use them (make the e-system mimic the old system?), it might even reduce labor.

BCS 2010-07-25 15:09:44

@BCS: that only works if the paper arrives perfect: no staples, in the right order, perfectly flat, consistent humidity!

Stephan Eggermont 2010-07-26 15:17:34

@BCS I'm aware of them. I was building in some time for the inevitable jams, time to get the papers straightened out, getting them to the scanner in the first place etc.

BioBuckyBall 2010-07-26 16:26:14

I'm starting from the assumption that 10k pages are being handled a day by whatever system is in place now. Aside from the issues unique to feeding them into a machine (i.e. about half of that list) someone is doing that anyway. I think the remainder will more than be covered by saved time.

BCS 2010-07-26 17:44:41

+3 A:

There's no way you're going to find OCR software that will read handwriting reliably, especially hand writing you'd describe as 'messy.'

You can spend a lot of money on a scanning system, but that's going to get very costly, very quick (at least $15k per high end scanner, plus the cost of the software, training, etc). And without reliable OCR you'd also have to manually key all the data you want to capture from each document. Obviously this will add to your costs significantly (more software, additional employees, etc) not to mention turnaround time from when the new documents are created to when they'd be available to users may not be acceptable for the daily volume you're talking about.

You'd be better off sending all your documents to a company like Iron Mountain. For the volume you're talking about - and assuming the documents you want scanned/keyed aren't overly complex - I'd be surprised if you couldn't get a better price than $.08 per page.

Such a company can deliver your images and data for import into some sort of document management software, or you can write your own app.

Jay Riggs 2010-07-16 22:31:01

jay, iron mountain is not going to be cheaper than 8c per page

I__ 2010-07-16 22:52:46

@I__ - I think you should shop around. I work for a company that does this and asked around and found we typically charge about half what Iron Mountain quoted you.

Jay Riggs 2010-07-26 15:20:44

@I__ - I'll also add that the data entry portion of this work can be outsourced offshore without violating HIPAA regulations. The key point is to not provide the offshore company with medical record data that includes individually identifiable health information (names, SSN, etc.) on the images sent offshore. This isn't difficult to do.

Jay Riggs 2010-07-26 15:29:38

jay what company do u work for, can u recommend a CO that will outsource this? how would i link the data with the patient?

I__ 2010-07-26 15:43:16

@I__ - My email's in my profile; you can contact me through it if you want. You can link the data with the patient by data entering the personal stuff from the record (names, SSNs, etc.) and then redacting this data from the scanned records. Before you send the record out for keying the rest of the data, assign the record an id number. When you get the keyed data back, merged this keyed data with the patient confidential data through the id number.

Jay Riggs 2010-07-28 06:42:43

+4 A:

Divide and Conquer!

If you do decide to go down the route of doing this 'in-house'. Your design needs to have scalability borne from day 1.

This is one rare case in which the task can be broken down and done in parallel.

If you have 10K documents, even if you built and deployed 10x (scanners + servers + custom app) that would mean each system would only need to handle around 1k documents each.

The challenge would be to make it a cheap and reliable 'turn key solution'.

The application side is probably the easier element, so long as you have a good automated update system designed from the start, you could then simply add hardware as you expand your 'farm/cluster'.

keeping your design modular (i.e. use commodity cheap hardware), will allow you to mix and match hardware/ replace on demand without impacting on daily throughput.

Trial initially to have a turn key solution that can sustain easily 1,000 documents. Then once this works flawlessly scale it up!

Good luck!

Edit 1:

Ok here is a more detailed answer to each specific points you have raised:

What are those systems that are used to scan checks and mail and they read really messy hand writing really well?

One such system as used by the mail/post delivery company 'TNT' here in the UK is provided by a Netherlands based company 'Prime Vision' and their HYCR Engine.

I highly suggest you contact them. Handwritten recognition is never going to be very accurate, OCR on printed characters can sometimes achieve 99% accuracy.

has anyone had experience building a database with a bunch of OCR'd searchable documents? What tools should I use for my problem?

Not specifically OCR'd documents, but for one of our clients, I build and maintain a very large and complex EDMS which holds a very large variety of document formats. It is searchable in multiple different ways whith complex set of data permission access.

In terms of giving advice, I'd say a few things to bear in mind:

Keep documents on file and have a link in the database
Store document directly in Database as BLOB data.

Each approach has its own set of pro's and con's. We opted to go the first route. In terms of search-ability, once you have the meta data of the actual documents. It is just a matter of creating custom search queries. I built a rank based search, it simply gave a higher ranking to those that matched more of the tokens. Of course you could use of the shelf searching tools (library) such as the Lucene Project.

Can you recommend the best OCR libraries?

yes:

tessnet
tesseract (same as above but for .NET)
OCROPlus Google Sponsored

As a programmer, what would you do to solve this problem?

As described above, please see diagram below. The heart of the system will be your Database, you will need to have a presentation front layer to allow clients (could be web application) to search documents in your database. The second part will be the Turnkey based OCR 'servers'.

For these 'OCR Servers' I would simply implement a 'drop folder' (which could be a FTP folder). Your custom application could simply monitor this drop folder (Folder Watcher Class in .NET). Files could be sent directly to this FTP folder.

Your custom OCR application would simply monitor the drop folder and upon receiving a new file, scan it generate the meta data and then move it to a 'Scanned' folder'. The ones that are duplicates or failed to scan can be moved to their own 'Failed Folder'.

The OCR application would then connect to your main Database and do some Inserts or updates (this moves the META DATA to the main database).

In the background you can have your 'Scanned Folder' synchronized with a mirrored folder in your database server (your SQL server as shown in the diagram) (This then physically copies your scanned and OCR'd document to the Main server where the linked records has already been moved.)

Anyway that's how I'd tackle this problem. I've personally implemented one or more of these solutions so I'm confident this would work and be scale-able.

The scale-ability is key here. For this reason you may want to look at alternative database other than the traditional ones.

I would recommend that you at least think about NoSQL type database for this project:

E.g

alt text

Un-ashamed Plug:

Of course for £40,000 I'd build and set up the whole solution for you (including hardware) !

:) I'm kidding SO users!

EDIT 2:

Note the mention of META DATA ,by this I mean the same as others have alluded to. The fact that you should retain the original copy of the scanned as an image file, along with the OCR'd meta data (such that it can allow for text searching).

I thought I make this clear, in case it was assumed that it was not part of my solution.

Darknight 2010-07-16 23:53:21

+1 A:

OCR-ing doctors' notes can't be easy :D

Try to figure out which of those 4M pages is immediately needed, and hire Iron mountain for those ones.

As for the rest, let your client know that you've been given a somewhat unfeasible task, and try to come up with a practical solution -- maybe they can just input a small portion of those papers and rely on statistics?

For the future, if you can format the information into multiple choice, something like Scantron might be an affordable solution.

Rei Miyasaka 2010-07-17 00:14:21

4m pages with iron mountain will be a rip off, around 280k

I__ 2010-07-21 16:01:23

+10 A:

Having worked at a medical office doing data entry, OCR will almost certainly not work. Our forms had special text boxes, with a separate box for each letter, and even for that the software was correct only about 75% of the time. There were some forms which allowed freeform writing, but the result was universally gibberish.

I would recommend going the meta-data route; scan everything, but instead of trying to OCR each form, just store it as an image and add meta-data tags.

My thinking is this: the goal of OCR in this case is to enable all forms to be read from the computer, thus making data retrieval simpler. However, you don't really need OCR to do that here, all you need to do is find some way which would allow someone to find a form really fast, and get the right info off the form. As such, even if you store each form as an image, adding the right meta-data tags would allow you to retrieve whatever you need whenever you need it, and the person running the search could either read it right off the stored form, or print it and read it that way.

EDIT: One fairly simple way of executing this plan could be to use a simple database scheme, where each image is stored as a single field. Each row could then contain something like the following, depending on your needs:

image name
patient ID
date of visit
...

Basically, think of all the ways you'd want to search for a given file, and make sure that it's included as a field. Do you look up patients by Patient ID? Include that. Date of visit? Same. If you aren't familiar with designing a database around search requirements, I suggest hiring a developer with database design skills; you can end up with a very powerful, yet quick, database schema which includes everything you want and is powerful enough for your indexing needs. (Bear in mind that much of this will be highly specific to your application. You'll want to optimize this to your situation, and ensure set it up as well as you can at the outset.)

eykanal 2010-07-21 15:37:24

this is a nice solution, so how would i add meta tags and store this info? please be specific

I__ 2010-07-21 16:00:57

addressed in the edit.

eykanal 2010-07-21 21:40:30

The only problem is then associating the image with the meta data's ID field. I could only think of manually doing it at scan time.

BioBuckyBall 2010-07-21 23:05:50

That's true. The only upside is that (I imagine) you'll be scanning patient by patient, so you may be able to save whole folders of images and then set metadata later on batch-style. However, (1) this is still way better than manually fixing all the OCR, and (2) you only do it once.

eykanal 2010-07-22 03:09:24

One additional comment... even if you can find some OCR technology which will capture every character perfectly and store all the text in a database, there will be no semantic ordering to the data you capture. You'll still have to go through and manually figure out what each form is, so that when you search you're not just doing a basic text search through 10+ million documents.

eykanal 2010-07-26 15:52:21

+1 A:

In my opinion the biggest problem is to get papper digital.
Once you have images I can think of two solutions (or better ideas).

Write an Application (not a Webapp!!!) which shows the images one by one to secretaries. The secretaries tag the images an a reference to the image an the tags are stored at a database. The UI must be very well designed (not loading time, auto guessing feature...) to get as much working speed as possible.
(my favourite) Use OCR software to scan then images an get searchable text. Then implement a application which built up a tree of the words used in the documents. Each word should have references to the documents it belongs too. Words like (in the an of...) should be excluded from the tree. Then you can search very quickly throw the tree and find the documents. If you want to match groups of words search for every single word and intersect the results. To perform more advance search throw the hole text I would recommend a modified DFA version which can process one character of data using only cheap instruction like table lookup (very advanced, I know it because of my interest in compiler design)... it should be possible to scan throw the hole textdata (at GB level) in acceptable time...

These are just suggestions!!!!! I just thought about it... Maybe there is something useful!

youllknow 2010-07-21 19:18:20

+1 A:

The best OCR software I have ever seen in my life is called ABBYY: http://www.abbyy.com/company

I have their software and use it at home for work related projects. It will scan documents, even documents that have graphics, such as logos and checkboxes, etc, and convert the resulting document to either Microsoft Word or PDF. Those are the most common exports. Whatever it comes across that it can't convert to text (like a logo for example), it will simply create a graphic image and place it in the document.

As far as how the post office does this, they use special OCR software (probably ABBYY) that can recognize hand writing: http://en.wikipedia.org/wiki/Remote_Bar_Coding_System

ABBYY also has an SDK, so if you would like to write your own application and integrate OCR into it, you can do that too!

icemanind 2010-07-21 20:52:45

+1 A:

Like others have already suggested, your situation is pretty much a standard ECM (enterprise content management)/archiving problem.

This is usually handled by using a "scanning platform" (depending on volume, the big ones are probably going to be something like EMC² Captiva or Kofax, or they can be done off-site as you have already indicated) to scan the paper documents and store the digital documents in a repository of some sort. This repository is traditionally an ECM platform such as Documentum (EMC²), FileNet (IBM), OpenText, ... These platforms will then offer you all kinds of features to use in conjunction with your digital documents, including full text search. Of course, all of the above has a price tag.

To give you my opinion on your specific questions:

What are those systems that are used to scan checks and mail and they read really messy hand writing really well?

Well any scanning solution will do. I'm not an expert on scanning, but I doubt any of these solutions will yield good results on hand writing.

has anyone had experience building a database with a bunch of OCR'd searchable documents? What tools should I use for my problem?

Nope. But this is what the ECM repositories will handle for you. There are alternatives, most notably Apache Lucene (http://lucene.apache.org) in the Java world.

Can you recommend the best OCR libraries?

As mentioned before, the only OCR library I know of that yields somewhat decent results is ABBYY.

As a programmer, what would you do to solve this problem?

If you do not need ECM, and you're confident that in the future you won't need the extra features provided by an ECM platform, then it's worth looking into building something custom. It's unlikely that this will be easy and straightforward, so you'll have to invest a lot of time designing it, and keep in mind that keeping something like this scalable will be no easy task.

pHk 2010-07-21 21:05:41

+1 A:

Free bootable OCR server: http://www.watchocr.com/

As featured on slashdot: http://linux.slashdot.org/story/10/07/22/1852234/Open-Source-OCR-That-Makes-Searchable-PDFs

Worth a shot at least.

jwsample 2010-07-23 01:03:32

+3 A:

You are currently solving the wrong problem, and 300K is peanuts, as others have already shown. You should focus on eliminating the 10K pages a day you receive now. The other problem just takes a fixed amount of money.

OCR only works reliably for handwriting in very limited domains (recognizing bank numbers, zip codes). The fine results OCR companies advertize with are of printed computer documents in standard formats and standard fonts.

The data entry should not be on paper. Period. Focus on making it so. Push the problem further upfront.

And yes, this is not a programmer problem. It is a management problem.

Stephan Eggermont 2010-07-26 12:35:48

ansaurus

tags:

views:

answers:

solution to OCR / search through 4 million pieces of paper and 10,000 added daily

Divide and Conquer!

Edit 1:

EDIT 2:

related questions