Comparing two PDF documents that are digitized faxes

views:

answers:

+1 Q:

Comparing two PDF documents that are digitized faxes

I did a fair bit of looking around on the board before I posted here but I didn't see anything that captured what I was hoping to do.

We receive a large number of inbound faxes (500+ pages/day) as separate documents (around 100+ documents/day). Quite often the sender (being a hospital) resends the same document a couple hours after the first try. I'd like to flag the second send as a "potential clone" so that it can be routed and flagged appropriately.

I want to know how I can compute and tag with some sort of hash or ID on each arriving fax (PDF/TIFF) then quickly do a scan in our document DB to see if it's unique or not.

Obviously there is no way without looking to be 100% sure but off the top of my head I'm thinking that one fax would be the same as another if:

Same # of pages
Sent within 24 hours of original
Hash code is similar (within threshold)

But I am getting a bit bogged down on the image compare. I am looking for a threshold hash code or some way to say "the images on p4 of each fax are 95% likely to be the same". It's possible, for example, that p4 of the original fax was skewed but p4 of the resent fax is straight. I was thinking of running all the fax pages through something like Inlite Research's ClearImage Repair first to straighten, rotate, and calibrate all pages.

Has anyone done something like this?

+2 A:

The difficulty is that if the second fax sent is the result of a new scan, the two files WILL have a distinct hash value.

Measuring a similarity (plausible duplicate) between documents would likely require to either OCR them, or otherwise compare (if a fuzzy fashion), their image content (i.e. after decompressing them).

Edit: Suggestions towards a HASH code for duplicate detection

Very tentatively, the following attributes of a document could be combined in some hash value susceptible of providing a good indication plausible duplication:

These attributes should be obtained for each individual page, the reason is that pages are unequivocal limits, so by being "hard" on these limits we can allow softer (fuzzier) measurements within the page content.
Not all the following attributes would be necessary. These are generally listed from the easier to get to the ones that require more programming.

characteristics of objects at the level of the PDF
(for each page!)
- size i.e. number of octets
- dimension (width and height; even though the same say "letter" format is used, the actual scanning results in distinct image size
OCR text
Image characteristics (black/white ratio, ...)

With regards to the "Hash", it should be a wide as possible, ideally a variable length hash made from appending say 32 bits or 64 bits hashes, one per page.

mjv 2009-12-02 23:32:43

I agree, the hashes would almost certainly be different. My request though is for some kind of hashing mechanism that enables me to catalog the first fax so that I can compute a "distance" between it and the second (or third, or fourth) fax. I need some sort of "threshold similarity" measure.

AlanK 2009-12-02 23:35:28

@unknown see edits where I suggest a few criteria that would likely allow creating a successful hash.

mjv 2009-12-02 23:58:10

+1 A:

If the documents are mostly text, OCR-ing them is a good idea. Comparing the text is straightforward.

Doing a "distance" calculation can be done, I suppose, but what if the fax is sent upside-down the second time? Or they enlarged it to make it more legible?

I'd try to tackle the subset of documents you're likely to encounter rather than applying a general algorithm. You'll get better results because it won't be looking for everything under the sun.

John at CashCommons 2009-12-02 23:38:35

Unfortunately the docs are a mix of handwritten and computer generated. Doctors' notes are rarely computer-generated (ask the President!).There is no subsetting, the documents really have a life of their own. Each hospital codes and captures data about patients and so on in very different ways.

AlanK 2009-12-03 04:00:41

+2 A:

If OCR is not an option, you could take an image-based approach. One possibility would be to downsample/filter the fax images (to remove high-frequency noise), then compute the normalized correlation between the pixels of the two downsampled images. Obviously, there are MUCH more robust approaches, but this might be sufficient to flag two faxes for manual inspection. Especially if the image repair software you mentioned can automatically orient and scale each page.

Donnie DeBoer 2009-12-02 23:55:30

I think the OpenCV library is what you're looking for. If I recall correctly it has image similarity tools. Either by landmark recognition and frequency domain techniques. It's possible to do approximate hashing in the frequency domain without having so much trouble with small differences in the images.

Rui Ferreira 2009-12-03 01:04:30

ansaurus

tags:

views:

answers:

Comparing two PDF documents that are digitized faxes

related questions