tags:

views:

689

answers:

8

Has anyone had any success with adding additional information to a PDF file?

We have an electronic medical record system which produces medical documents for our users. In the past, those documents have been Print-To-File (.prn) files which we have fed to a system that displayed them as part of an enterprise medical record.

Now the hospital's enterprise medical record vendor wants to receive the documents as PDF, but still wants all of the same information stored in the header.

Honestly, we can't figure out how to put information into a PDF file that doesn't break the PDF file.

Here is the start of one of our PDFs...

%PDF-1.4  
%âãÏÓ  
6 0 obj  
<<  
   /Type /XObject  
   /Subtype /Image  
   /BitsPerComponent 8  
   /Width 854  
   /Height 130  
   /ColorSpace /DeviceRGB  
   /Filter /DCTDecode  
   /Length 17734>>  
stream

In our PRN files, we would insert information like this:

%MRN% TEST000001
%ACCT% TEST0000000000001
%DATE% 01/01/2009^16:44
%DOC_TYPE% Clinical
%DOC_NUM% 192837475
%DOC_VER% 1

My question is, can I insert this information into a PDF in a manner which allows the document server to perform post-processing, yet is NOT visible to the doctor who views the PDF?

Thank you,

David Walker

A: 

This might help. http://www.adobe.com/devnet/acrobat/pdfs/PDF32000_2008.pdf

THEn
+1  A: 

You can still insert comments into a PDF file using the % character. But anyone would be able to access with a text editor.

Your vendor could remove these comments after post-processing, so it doesn't actually get to the doctors.

eduffy
+6  A: 

Yes, you can. Any line in a PDF file that starts with a percent sign is a comment and as such ignored (the first two lines of the PDF actually are comments as well). So you can pretty much insert your information into the PDF as you did into the PRN.

However:

The PDF format works with byte position references, so if you insert data into a finished PDF file, this will push the rest of the data away from their original position and thus break the file. You can also not append it to the file, because a PDF file has to end with

startxref
123456
%%EOF

(the 123456 is an example). You could insert your data right before these three lines. The byte position of the "startxref" part is never referenced anywhere, so you won't break anything if you push this final part towards the end.

Edit: This of course assumes there is no checksumming, signing or encryption going on. That would make things more complicated.

Edit 2: As Javier pointed out correctly, you can also just add your data to the end and just add a copy of the three lines to the end of that. Boils down to the same thing, but it's a little easier.

balpha
This was close. Acrobat PDF reader 9 did not care that I placed my comments after the finall %%EOF. Reading and writing a file stream to place the characters into the a new file was changing the actual data. So I opened a file for appending and added the comments to the end of the file and it worked. Thank you for the information about the byte positions.
David Walker
Adobe's programs are quite relaxed about standards conformance, so they don't mind if there's some garbage at the beginning or end. But note that you don't have a conforming PDF anymore if the last line doesn't read %%EOF, which may or may not matter in your case.
balpha
There used to be a note in the PDF specification that said Adobe PDF products only need the %%EOF to appear somewhere in the final 1024 bytes of the file. Other PDF software may differ.
Bing
+3  A: 

PDFs are supposed to have multiple versions just appending at the end; but the very end must have the offset to the main reference table. Just read the last three lines, append your data and reattach the original ending.

You can either remove the original ending or let it there. PDF readers will just go to the end and use the second-to-last line to find the reference table.

Javier
A: 

At one point we were changing some Acrobat JS code by doing a text replace in a plain (unencrypted) PDF. The trick was that the lengths of each PDF block were hard coded in the document. So, we could not change the number of characters. We would just add extra spaces.

It worked great, the JS code executed an all.

Tom Hubbard
A: 

Have you thought about using XMP?

Tim
+1  A: 

You can store the data as real PDF metadata. For example, with CAM::PDF you can write metadata like this:

use CAM::PDF;
my $pdf = CAM::PDF->new('temp.pdf') || die;
my $info = $pdf->getValue($pdf->{trailer}->{Info}) || die;
$info->{PRN} = CAM::PDF::Node->new('dictionary', {
   DOC_TYPE => CAM::PDF::Node->new('string', 'Clinical'),
   DOC_NUM  => CAM::PDF::Node->new('number', 192837475),
   DOC_VER  => CAM::PDF::Node->new('number', 1),
});
$pdf->cleanoutput('out.pdf');

The Info node of the PDF then looks like this:

8 0 obj
<< /CreationDate (D:20080916083455-04'00')
/ModDate (D:20080916083729-04'00')
/PRN << /DOC_NUM 192837475 /DOC_TYPE (Clinical) /DOC_VER 1 >> >>
endobj

You can read the PRN data back out like so (simplistic code...)

my $pdf = CAM::PDF->new('out.pdf') || die;
my $info = $pdf->getValue($pdf->{trailer}->{Info}) || die;
my $prn = $info->{PRN};
if ($prn) {
   my $prndict = $pdf->getValue($prn);
   for my $key (sort keys %{$prndict}) {
      print "$key = ", $pdf->getValue($prndict->{$key}), "\n";
   }
}

Which makes output like this:

DOC_NUM = 192837475
DOC_TYPE = Clinical
DOC_VER = 1

PDF supports arbitrarily nested arrays, dictionaries and references so just about any data can be represented. For example, I built an entire filesystem embedded in a PDF just for fun!

Chris Dolan
A: 

Have you ever thought to embed your additional info inside the PDF as a separate file?

The generic PDF specification allows to "attach files" to PDFs. Attached files can be anything: *.txt, *.doc, *.xsl, *.html or even .pdf. Attached files are contained in the PDF "container" file without corrupting the container's own content. (Special-purpose PDF specifications such as PDF/A- and PDF/X-* may impose some restrictions about embedded/attached files.)

That allows you to tie additional info and/or data to PDF files and allow for common storage and processing. Attached files are supposed to not disturb any PDF viewer's rendering.

I've used that feature frequently, for various purposes:

  • store the parent document (like .doc) inside the .pdf from which the .pdf was created in the first place;
  • tag a job ticketing information to a printfile that is sent to the printshop;
  • etc.pp.

Of course, recently discovered and published flaws in PDF processing software (and in the PDF spec itself) suggest to stay away from embedding/attaching binary files to PDF files -- because more and more Readers will by default stop you from easily extracting/detaching the embedded/attached files.

However, there is no reason why you shouldn't be able to put your additional info into a medical-record-info.txt file of arbitrary lenght and internal format and attach it to the PDF:

 MRN TEST000001
 ACCT TEST0000000000001
 DATE 2009-01-01
 TIME 16:44:33.76
 DOC_TYPE Clinical
 DOC_NUM 192837475
 DOC_VER 1
 MORE_INFO blah blah

 Hi, guys,
     can you please process this file faster than usual? If you don't,
     someone will be dying.
 Seriously, David. 

FWIW, the commandline tools pdftk.exe (Windows) and pdftk (Linux) are able to attach and detach embedded files from their container PDF. Acrobat Reader can also handle attachments.

You could setup/program/script your document server handling the PDF to automatically detach the embedded .txt file and trigger actions according to its content.

Of course, the doctor who views the PDF would be able to see there is a file attachment in the PDF. But it wouldn't appear in his "normal" viewing. He'd have to take specific additional actions in order to extract and view it. (And then there is the option to set a password on the PDF to protect it from un-authorized file detachments. And/or encode, obscure, rot13 the .txt. Not exactly rock-solid methods, but 99% of doctors wouldn't be able to accomplish it even if you teach them how to...)

pipitas