views:

296

answers:

3

I'm having an issue where I'm corrupted a PDF and not sure of a proper solution. I've seen several posts on people trying to just do a basic stream or trying to modify the file with a third party library. This is how my situation differs...

I have all the web pieces in place to get me the PDF streamed back and it works fine until I try to modify it with C#.

  1. I've modified the PDF in a text editor manually to remove the <> entries and tested that the PDF functions properly after that.

  2. I've then programmatically streamed the PDF in as byte[] from the database, convert it to a string, using a RegEx to find and remove the same stuff I tried removing manually.

  3. THE PROBLEM! When I try to convert the modified PDF string contents back into a byte[] to stream back, the PDF encoding no longer seems to be correct. What is the correct encoding?

Does anyone know the best way to do something like this? I'm just trying to keep my solution as light as possible because our site is geared towards PDF document access so heavy APIs or complex are not preferable unless no other options are available. Also, because this situation is really only when our users view the file in an iframe for "preview", I can't permanently modify the PDF.

Thanks for your help in advance!

A: 

Look into IText. There is a reason why things like the apache commons library exist.

Woot4Moo
Can iText actually parse PDFs to perform a roundtrip operation? I thought it was able to create them only.
Lucero
I appreciate your advice and I understand why libraries exist but when I'm 90% of the way there by using 1 RegEx line of code, I'm not really looking to jump to a library yet. I think the gap is my lack of understanding of how the PDF is encoded (and haven't found good answers) so I can get that encoding right when streaming it back out. Can you offer any advice in that area?
Scott
according to: http://www.lowagie.com/iText/yes Lucero :)Scott:Does the pdf use international character sets? If so I think that you may have to jump to the library, even if you were Adobe :)
Woot4Moo
I think that is likely part of the problem that I'm struggling with. Encoding/Decoding aren't my areas of strength. Our app displays 10,000+ different PDFs but the audience has never been non-US (not that it really means anything about how they were created but I don't know). Unless you have a good thought on how I can check that, I'm not sure what I would do to verify that.
Scott
Sadly, I do not have many ideas in that area.
Woot4Moo
Ok. Again, thanks for your help!
Scott
+2  A: 

You seem to be discovering that...

the PDF format is not trivial!

Whereby it may be OK (yet kludgey) to patch a few "text" bytes, in-situ (i.e. keeping size and structure unchanged), "messing" much more that that with the PDF files typically ends up breaking them. Regular expression for sure seem to be a blunt tool for the job.

The PDF file needs to be parsed and seen as a hierarchical collection objects (and then some..), and that's why we need the libraries which encapsulate the knowledge about the format.

If you need convincing, you may peruse the, now ISO standard, specification for the PDF Format (version 1.7) available for free on Adobe web site. BTW, these 750 pages cover the latest version, while there's much overlay, previous versions introduce yet another layer of details to contend with...

Edit:
This said, in re-reading the question, and Lucero's remark, the changes indicated do seem small/safe enough that a "snip and tuck" approach may work.
Beware that this type of approach may lead to issues, over time (when the format encountered is of a different, older or newer!, version, or when the file content, somehow causes different structures to be exposed, or...) or also with some specific uses (for example it may prevent users to use some features of the PDF documents such as forms or security). Maybe a compromise is to learn enough about the format(s) at hand and confirm that the changes are indeed casual.

Also... while the PDF format is a relatively complicated affair, the libraries that deal with it are not necessarily heavy, and they are typically easy to use.

In short, you'll need to weight the benefits and drawbacks of both approaches and pick accordingly ;-) (how was that for a "non-answer").

mjv
While you are correct, it seems that in his case the change is trivial enough since the required operation worked when performed via text editor. Scott is having an encoding problem.
Lucero
Good points mjv. I will keep close eye on this in the near future and consider refactoring it to a library that makes sense for me after I wrap up the current iteration of my project.
Scott
+1  A: 

Try to use the following BinaryEncoding class as encoding. It basically casts all bytes to chars (and back), so that only ASCII data can correctly be processed as string, but the rest of the data is kept unchanged and nothing is lost as long as you don't use any UNICODE characters > 0x00FF. So for your roundtrip it should work just fine.

public class BinaryEncoding: Encoding {
 private static readonly BinaryEncoding @default = new BinaryEncoding();

 public static new BinaryEncoding Default {
  get {
   return @default;
  }
 }

 public override int GetByteCount(char[] chars, int index, int count) {
  if (chars == null) {
   throw new ArgumentNullException("chars");
  }
  return count;
 }

 public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex) {
  if (chars == null) {
   throw new ArgumentNullException("chars");
  }
  if (bytes == null) {
   throw new ArgumentNullException("bytes");
  }
  if (charCount < 0) {
   throw new ArgumentOutOfRangeException("charCount");
  }
  unchecked {
   for (int i = 0; i < charCount; i++) {
    bytes[byteIndex+i] = (byte)chars[charIndex+i];
   }
  }
  return charCount;
 }

 public override int GetCharCount(byte[] bytes, int index, int count) {
  if (bytes == null) {
   throw new ArgumentNullException("bytes");
  }
  return count;
 }

 public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex) {
  if (bytes == null) {
   throw new ArgumentNullException("bytes");
  }
  if (chars == null) {
   throw new ArgumentNullException("chars");
  }
  if (byteCount < 0) {
   throw new ArgumentOutOfRangeException("byteCount");
  }
  unchecked {
   for (int i = 0; i < byteCount; i++) {
    chars[charIndex+i] = (char)bytes[byteIndex+i];
   }
  }
  return byteCount;
 }

 public override int GetMaxByteCount(int charCount) {
  return charCount;
 }

 public override int GetMaxCharCount(int byteCount) {
  return byteCount;
 }
}
Lucero
I think I may reuse this trick in other contexts, with mixed content (and when it is not suitable to work directly with a binary file/stream). +1 for the idea! It may even work for the OP's problem, provided either long patterns, or no UTF8, no hard offsets etc. I'm still a bit puzzled at Scott's reluctance to try a [one of the lighter-weight] PDF libraries. G'day!
mjv
I'm not against trying a library at all. I just want to pursue what I thought was an idea that I seemed very close to working before I abandoned it. At the end of the day, I have a task to complete in X hours and rather not be behind schedule if i was close. I'm definately appreciative and open to ideas but wanted to make sure I was clear in my initial attempt.
Scott
The BinaryEncoding class solved my immediate solution. I do think everyone had a lot of valid points about using libraries and such so I just wanted to be clear that if this wasn't a very simple modification to the document, I would definitely advise people to use a library as well. Thanks for everyone's guidance!
Scott