tags:

views:

97

answers:

4

Hi, I cant read a file, and I dont understand why:

f = open("test/test.pdf", "r")
data = list(f.read())
print data

Returns : []

I would like to open a PDF, and extract every bytes, and put it in a List.

What's wrong with my code ? :(

Thanks,

+4  A: 
f = open("test/test.pdf", "rb")

You must include the pseudo-mode "b" for binary when reading and writing on Windows. Otherwise the OS silently translates what it considers to be "line endings", causing i/o corruption.

Jonathan Feinberg
+1  A: 

Jonathan is correct that you should be opening the file in binary mode if you are on windows.

However, a PDF file will start with "%PDF-", which would at least be read in regardless of whether you are using binary mode or not.

So it appears to me that your "test/test.pdf" is an empty file

gnibbler
A: 
  • As best as I understand the pdf format, a pdf file shouldn't be a binary file. It should be a text file that may contain lots of binary blobs. I could be wrong.
  • On Windows, if you are opening a binary file, you need to include b in the mode of your file, i.e. open(filename, "rb").
    • On Unix-like systems, the b doesn't hurt anything, though it does not mean anything.
  • Always use a context manager with your files. That is to say, instead of writing f = open("test/test.pdf", "rb"), say with open("test/test.pdf", "r") as f:. This will assure your file always gets closed.
  • list(f.read()) is not likely to be useful code very often. f.read() reaurns a str and calling list on it makes a list of the characters (one-byte strings). This is very seldom needed.
  • Binary or text or whatever, read should work. Are you positive that there is anything in test/test.pdf? Python does not seem to think there is.
Mike Graham
A: 

What platform are you running on?

Using python 2.6 on Windows XP, I get:

f = open("14500lf.pdf", "r")
data = list(f.read())
print data
['%', 'P', 'D', 'F', '-', '1', '.', '5', '\r', '%', '\xe2', '\xe3', '\xcf', '\xd3', '\n', '1', ' ', '0', ' ', 'o', 'b', 'j', '<', '<', '/', 'C', 'o', 'n', 't', 'e', 'n', 't', 's', ' ', '3', ' ', '0', ' ', 'R', '/', 'T', 'y', 'p', 'e', '/', 'P', 'a', 'g', 'e', '/', 'P', 'a', 'r', 'e', 'n', 't', ' ', '8', '7', ' ', '0', ' ', 'R', '/', 'T', 'h', 'u', 'm', 'b', ' ', '7', '1', ' ', '0', ' ', 'R', '/', 'R', 'o', 't', 'a', 't', 'e', ' ', '0', '/', 'M', 'e', 'd', 'i', 'a', 'B', 'o', 'x', '[', '0', ' ', '0', ' ', '6', '1', '2', ' ', '7', '9', '2', ']', '/', 'C', 'r', 'o', 'p', 'B', 'o', 'x', '[', '0', ' ', '0', ' ', '6', '1', '2', ' ', '7', '9', '2', ']', '/', 'R', 'e', 's', 'o', 'u', 'r', 'c', 'e', 's', ' ', '2', ' ', '0', ' ', 'R', '>', '>', '\r', 'e', 'n', 'd', 'o', 'b', 'j', '\r', '2', ' ', '0', ' ', 'o', 'b', 'j', '<', '<', '/', 'C', 'o', 'l', 'o', 'r', 'S', 'p', 'a', 'c', 'e', '<', '<', '/', 'D', 'e', 'f', 'a', 'u', 'l', 't', 'R', 'G', 'B', ' ', '1', '0', '0', ' ', '0', ' ', 'R', '>', '>', '/', 'F', 'o', 'n', 't', '<', '<', '/', 'F', '5', ' ', '9', '6', ' ', '0', ' ', 'R', '/', 'F', '7', ' ', '9', '7', ' ', '0', ' ', 'R', '/', 'F', '9', ' ', '1', '0', '6', ' ', '0', ' ', 'R', '/', 'F', '1', '1', ' ', '1', '0', '7', ' ', '0', ' ', 'R', '/', 'F', '1', '4', ' ', '1', '1', '1', ' ', '0', ' ', 'R', '/', 'F', '1', '6', ' ', '1', '1', '6', ' ', '0', ' ', 'R', '/', 'F', '1', '7', ' ', '1', '1', '7', ' ', '0', ' ', 'R', '/', 'F', '1', '3', ' ', '1', '1', '2', ' ', '0', ' ', 'R', '>', '>', '/', 'P', 'r', 'o', 'c', 'S', 'e', 't', '[', '/', 'P', 'D', 'F', '/', 'T', 'e', 'x', 't', ']', '>', '>', '\r', 'e', 'n', 'd', 'o', 'b', 'j', '\r', '3', ' ', '0', ' ', 'o', 'b', 'j', '<', '<', '/', 'L', 'e', 'n', 'g', 't', 'h', ' ', '4', ' ', '0', ' ', 'R', '/', 'F', 'i', 'l', 't', 'e', 'r', '/', 'F', 'l', 'a', 't', 'e', 'D', 'e', 'c', 'o', 'd', 'e', '>', '>', 's', 't', 'r', 'e', 'a', 'm', '\n', 'H', '\x89', '\xa4', 'W', '\xd9', 'r', 'T', '\xc9', '\x11', '\xfd', '\x82', '\xfb', '\x0f', '\xf5', '\xd8', '\n', '\x8f', '\x8a', '\xda', '\x97', 'G', '!', '\x04', '\x06', '\x03']

On a PDF I happen to have on my desktop (Its a IC Datasheet LTC1450)

Using "rb" (Read Binary):

f = open("14500lf.pdf", "rb")
data = list(f.read())
print data
['%', 'P', 'D', 'F', '-', '1', '.', '5', '\r', '%', '\xe2', '\xe3', '\xcf', '\xd3', '\r', '\n', '1', ' ', '0', ' ', 'o', 'b', 'j', '<', '<', '/', 'C', 'o', 'n', 't', 'e', 'n', 't', 's', ' ', '3', ' ', '0', ' ', 'R', '/', 'T', 'y', 'p', 'e', '/', 'P', 'a', 'g', 'e', '/', 'P', 'a', 'r', 'e', 'n', 't', ' ', '8', '7', ' ', '0', ' ', 'R', '/', 'T', 'h', 'u', 'm', 'b', ' ', '7', '1', ' ', '0', ' ', 'R', '/', 'R', 'o', 't', 'a', 't', 'e', ' ', '0', '/', 'M', 'e', 'd', 'i', 'a', 'B', 'o', 'x', '[', '0', ' ', '0', ' ', '6', '1', '2', ' ', '7', '9', '2', ']', '/', 'C', 'r', 'o', 'p', 'B', 'o', 'x', '[', '0', ' ', '0', ' ', '6', '1', '2', ' ', '7', '9', '2', ']', '/', 'R', 'e', 's', 'o', 'u', 'r', 'c', 'e', 's', ' ', '2', ' ', '0', ' ', 'R', '>', '>', '\r', 'e',

....Snip a few thousand lines...

'9', '1', ' ', '0', ' ', 'R', '/', 'I', 'D', '[', '<', 'd', 'd', '3', 'd', '2', '8', '5', 'e', '1', 'd', '9', '0', '4', '6', 'e', '1', 'f', '6', 'e', '7', '0', '8', 'b', 'd', '8', 'e', '4', 'f', '9', 'b', '1', '3', '>', '<', '4', '3', '8', 'a', '7', '7', '2', '3', 'f', 'b', '2', '9', 'e', '7', '4', '6', 'a', '4', 'd', '4', '1', '6', 'a', 'f', '7', '6', '2', 'd', '8', '0', '9', '5', '>', ']', '>', '>', '\r', '\n', 's', 't', 'a', 'r', 't', 'x', 'r', 'e', 'f', '\r', '\n', '2', '9', '0', '2', '6', '9', '\r', '\n', '%', '%', 'E', 'O', 'F', '\r', '\n']

Fake Name