tags:

views:

233

answers:

4

I'm parsing a pdf file...I converted data into byte array but it doesnt show full file.. i dnt want to use any lib or softy..

        FileStream fs = new FileStream(fname, FileMode.Open);
        BinaryReader br = new BinaryReader(fs);

        int pos = 0;
        int length = (int)br.BaseStream.Length;

        byte [] file = br.ReadBytes(length);

        String text = System.Text.ASCIIEncoding.ASCII.GetString(file);

        displayFile.Text = text;
+2  A: 

It would really help if you'd give more detail - including some code, preferably a short but complete program that demonstrates the problem.

My guess is that when you're doing the conversion you end up with some text containing a null character ('\0') - which Windows Forms controls treat as a string terminator.

For example, if you use:

label.Text = "hello\0there";

you'll only see "hello".

Now you may have this problem due to converting from a byte array to text using the wrong encoding - but we can't really help much more with the little information you've provided.

Jon Skeet
+2  A: 

Based on your code example, I would say that the problem is that you are assuming that the PDF file contains plain ascii text, which is not the case. PDF is a complicated format, and there are libraries that allow you to parse them.

Doing a quick google search: iTextSharp can read the pdf format.

mlsteeves
+1  A: 

You cannot convert a PDF to text by just interpreting it as ASCII. You may be lucky enough that some of the text actually is ASCII, but you can also expect some of the non-text contents to be indistinguishable from ASCII.

Instead use one of the solutions for parsing PDF. Here is one way using PDFBox and IKVM: Naspinski.net: Parsing/Reading a PDF file with C# and Asp.Net to text

Rasmus Faber
A: 

Even pure Ascii set contains lots of non-printable, non-display-able and control characters.

Like Jon said, a \0 (NUL) at the beginning of a string terminates everything in .NET. I had painful experience with this behavior years back. Control characters like 'bell' and 'backspace' etc etc will give you funny output. But do not expect to hear a bell ringing :P.

o.k.w
\0 doesn't terminate a string in .NET itself - it terminates it in a Windows Forms control.
Jon Skeet
@Jon: You are right, I had the experience of using log4net within a WinForm and the log output to a file also got terminated by `\0`. I've always 'blamed' .NET, now I know the culprit.
o.k.w