I did a fair bit of looking around on the board before I posted here but I didn't see anything that captured what I was hoping to do.
We receive a large number of inbound faxes (500+ pages/day) as separate documents (around 100+ documents/day). Quite often the sender (being a hospital) resends the same document a couple hours after the first try. I'd like to flag the second send as a "potential clone" so that it can be routed and flagged appropriately.
I want to know how I can compute and tag with some sort of hash or ID on each arriving fax (PDF/TIFF) then quickly do a scan in our document DB to see if it's unique or not.
Obviously there is no way without looking to be 100% sure but off the top of my head I'm thinking that one fax would be the same as another if:
- Same # of pages
- Sent within 24 hours of original
- Hash code is similar (within threshold)
But I am getting a bit bogged down on the image compare. I am looking for a threshold hash code or some way to say "the images on p4 of each fax are 95% likely to be the same". It's possible, for example, that p4 of the original fax was skewed but p4 of the resent fax is straight. I was thinking of running all the fax pages through something like Inlite Research's ClearImage Repair first to straighten, rotate, and calibrate all pages.
Has anyone done something like this?