ansaurus

Question

How do I modify PDF without a library using C# and stream it back to client in ASP.NET?

Answer 1

A:

Look into IText. There is a reason why things like the apache commons library exist.

Woot4Moo 2009-11-20 17:07:33

Can iText actually parse PDFs to perform a roundtrip operation? I thought it was able to create them only.

Lucero 2009-11-20 17:14:48

I appreciate your advice and I understand why libraries exist but when I'm 90% of the way there by using 1 RegEx line of code, I'm not really looking to jump to a library yet. I think the gap is my lack of understanding of how the PDF is encoded (and haven't found good answers) so I can get that encoding right when streaming it back out. Can you offer any advice in that area?

Scott 2009-11-20 17:15:44

according to: http://www.lowagie.com/iText/yes Lucero :)Scott:Does the pdf use international character sets? If so I think that you may have to jump to the library, even if you were Adobe :)

Woot4Moo 2009-11-20 17:31:19

I think that is likely part of the problem that I'm struggling with. Encoding/Decoding aren't my areas of strength. Our app displays 10,000+ different PDFs but the audience has never been non-US (not that it really means anything about how they were created but I don't know). Unless you have a good thought on how I can check that, I'm not sure what I would do to verify that.

Scott 2009-11-20 17:35:58

Sadly, I do not have many ideas in that area.

Woot4Moo 2009-11-20 17:55:08

Ok. Again, thanks for your help!

Scott 2009-11-20 19:20:44

Answer 2

+2 A:

You seem to be discovering that...

the PDF format is not trivial!

Whereby it may be OK (yet kludgey) to patch a few "text" bytes, in-situ (i.e. keeping size and structure unchanged), "messing" much more that that with the PDF files typically ends up breaking them. Regular expression for sure seem to be a blunt tool for the job.

The PDF file needs to be parsed and seen as a hierarchical collection objects (and then some..), and that's why we need the libraries which encapsulate the knowledge about the format.

If you need convincing, you may peruse the, now ISO standard, specification for the PDF Format (version 1.7) available for free on Adobe web site. BTW, these 750 pages cover the latest version, while there's much overlay, previous versions introduce yet another layer of details to contend with...

Edit:
This said, in re-reading the question, and Lucero's remark, the changes indicated do seem small/safe enough that a "snip and tuck" approach may work.
Beware that this type of approach may lead to issues, over time (when the format encountered is of a different, older or newer!, version, or when the file content, somehow causes different structures to be exposed, or...) or also with some specific uses (for example it may prevent users to use some features of the PDF documents such as forms or security). Maybe a compromise is to learn enough about the format(s) at hand and confirm that the changes are indeed casual.

Also... while the PDF format is a relatively complicated affair, the libraries that deal with it are not necessarily heavy, and they are typically easy to use.

In short, you'll need to weight the benefits and drawbacks of both approaches and pick accordingly ;-) (how was that for a "non-answer").

mjv 2009-11-20 17:13:03

While you are correct, it seems that in his case the change is trivial enough since the required operation worked when performed via text editor. Scott is having an encoding problem.

Lucero 2009-11-20 17:15:47

Good points mjv. I will keep close eye on this in the near future and consider refactoring it to a library that makes sense for me after I wrap up the current iteration of my project.

Scott 2009-11-20 21:55:52

Answer 3

+1 A:

Try to use the following BinaryEncoding class as encoding. It basically casts all bytes to chars (and back), so that only ASCII data can correctly be processed as string, but the rest of the data is kept unchanged and nothing is lost as long as you don't use any UNICODE characters > 0x00FF. So for your roundtrip it should work just fine.

public class BinaryEncoding: Encoding {
 private static readonly BinaryEncoding @default = new BinaryEncoding();

 public static new BinaryEncoding Default {
  get {
   return @default;
  }
 }

 public override int GetByteCount(char[] chars, int index, int count) {
  if (chars == null) {
   throw new ArgumentNullException("chars");
  }
  return count;
 }

 public override int GetBytes(char[] chars, int charIndex, int charCount, byte[] bytes, int byteIndex) {
  if (chars == null) {
   throw new ArgumentNullException("chars");
  }
  if (bytes == null) {
   throw new ArgumentNullException("bytes");
  }
  if (charCount < 0) {
   throw new ArgumentOutOfRangeException("charCount");
  }
  unchecked {
   for (int i = 0; i < charCount; i++) {
    bytes[byteIndex+i] = (byte)chars[charIndex+i];
   }
  }
  return charCount;
 }

 public override int GetCharCount(byte[] bytes, int index, int count) {
  if (bytes == null) {
   throw new ArgumentNullException("bytes");
  }
  return count;
 }

 public override int GetChars(byte[] bytes, int byteIndex, int byteCount, char[] chars, int charIndex) {
  if (bytes == null) {
   throw new ArgumentNullException("bytes");
  }
  if (chars == null) {
   throw new ArgumentNullException("chars");
  }
  if (byteCount < 0) {
   throw new ArgumentOutOfRangeException("byteCount");
  }
  unchecked {
   for (int i = 0; i < byteCount; i++) {
    chars[charIndex+i] = (char)bytes[byteIndex+i];
   }
  }
  return byteCount;
 }

 public override int GetMaxByteCount(int charCount) {
  return charCount;
 }

 public override int GetMaxCharCount(int byteCount) {
  return byteCount;
 }
}

Lucero 2009-11-20 17:13:50

I think I may reuse this trick in other contexts, with mixed content (and when it is not suitable to work directly with a binary file/stream). +1 for the idea! It may even work for the OP's problem, provided either long patterns, or no UTF8, no hard offsets etc. I'm still a bit puzzled at Scott's reluctance to try a [one of the lighter-weight] PDF libraries. G'day!

mjv 2009-11-20 17:56:37

I'm not against trying a library at all. I just want to pursue what I thought was an idea that I seemed very close to working before I abandoned it. At the end of the day, I have a task to complete in X hours and rather not be behind schedule if i was close. I'm definately appreciative and open to ideas but wanted to make sure I was clear in my initial attempt.

Scott 2009-11-20 18:01:27

The BinaryEncoding class solved my immediate solution. I do think everyone had a lot of valid points about using libraries and such so I just wanted to be clear that if this wasn't a very simple modification to the document, I would definitely advise people to use a library as well. Thanks for everyone's guidance!

Scott 2009-11-20 19:19:53

ansaurus

tags:

views:

answers:

How do I modify PDF without a library using C# and stream it back to client in ASP.NET?

the PDF format is not trivial!

related questions