Hi, I cant read a file, and I dont understand why:
f = open("test/test.pdf", "r")
data = list(f.read())
print data
Returns : []
I would like to open a PDF, and extract every bytes, and put it in a List.
What's wrong with my code ? :(
Thanks,
Hi, I cant read a file, and I dont understand why:
f = open("test/test.pdf", "r")
data = list(f.read())
print data
Returns : []
I would like to open a PDF, and extract every bytes, and put it in a List.
What's wrong with my code ? :(
Thanks,
f = open("test/test.pdf", "rb")
You must include the pseudo-mode "b" for binary when reading and writing on Windows. Otherwise the OS silently translates what it considers to be "line endings", causing i/o corruption.
Jonathan is correct that you should be opening the file in binary mode if you are on windows.
However, a PDF file will start with "%PDF-", which would at least be read in regardless of whether you are using binary mode or not.
So it appears to me that your "test/test.pdf" is an empty file
b
in the mode of your file, i.e. open(filename, "rb")
.
b
doesn't hurt anything, though it does not mean anything.f = open("test/test.pdf", "rb")
, say with open("test/test.pdf", "r") as f:
. This will assure your file always gets closed.list(f.read())
is not likely to be useful code very often. f.read()
reaurns a str
and calling list
on it makes a list of the characters (one-byte strings). This is very seldom needed.read
should work. Are you positive that there is anything in test/test.pdf
? Python does not seem to think there is.What platform are you running on?
Using python 2.6 on Windows XP, I get:
f = open("14500lf.pdf", "r")
data = list(f.read())
print data
['%', 'P', 'D', 'F', '-', '1', '.', '5', '\r', '%', '\xe2', '\xe3', '\xcf', '\xd3', '\n', '1', ' ', '0', ' ', 'o', 'b', 'j', '<', '<', '/', 'C', 'o', 'n', 't', 'e', 'n', 't', 's', ' ', '3', ' ', '0', ' ', 'R', '/', 'T', 'y', 'p', 'e', '/', 'P', 'a', 'g', 'e', '/', 'P', 'a', 'r', 'e', 'n', 't', ' ', '8', '7', ' ', '0', ' ', 'R', '/', 'T', 'h', 'u', 'm', 'b', ' ', '7', '1', ' ', '0', ' ', 'R', '/', 'R', 'o', 't', 'a', 't', 'e', ' ', '0', '/', 'M', 'e', 'd', 'i', 'a', 'B', 'o', 'x', '[', '0', ' ', '0', ' ', '6', '1', '2', ' ', '7', '9', '2', ']', '/', 'C', 'r', 'o', 'p', 'B', 'o', 'x', '[', '0', ' ', '0', ' ', '6', '1', '2', ' ', '7', '9', '2', ']', '/', 'R', 'e', 's', 'o', 'u', 'r', 'c', 'e', 's', ' ', '2', ' ', '0', ' ', 'R', '>', '>', '\r', 'e', 'n', 'd', 'o', 'b', 'j', '\r', '2', ' ', '0', ' ', 'o', 'b', 'j', '<', '<', '/', 'C', 'o', 'l', 'o', 'r', 'S', 'p', 'a', 'c', 'e', '<', '<', '/', 'D', 'e', 'f', 'a', 'u', 'l', 't', 'R', 'G', 'B', ' ', '1', '0', '0', ' ', '0', ' ', 'R', '>', '>', '/', 'F', 'o', 'n', 't', '<', '<', '/', 'F', '5', ' ', '9', '6', ' ', '0', ' ', 'R', '/', 'F', '7', ' ', '9', '7', ' ', '0', ' ', 'R', '/', 'F', '9', ' ', '1', '0', '6', ' ', '0', ' ', 'R', '/', 'F', '1', '1', ' ', '1', '0', '7', ' ', '0', ' ', 'R', '/', 'F', '1', '4', ' ', '1', '1', '1', ' ', '0', ' ', 'R', '/', 'F', '1', '6', ' ', '1', '1', '6', ' ', '0', ' ', 'R', '/', 'F', '1', '7', ' ', '1', '1', '7', ' ', '0', ' ', 'R', '/', 'F', '1', '3', ' ', '1', '1', '2', ' ', '0', ' ', 'R', '>', '>', '/', 'P', 'r', 'o', 'c', 'S', 'e', 't', '[', '/', 'P', 'D', 'F', '/', 'T', 'e', 'x', 't', ']', '>', '>', '\r', 'e', 'n', 'd', 'o', 'b', 'j', '\r', '3', ' ', '0', ' ', 'o', 'b', 'j', '<', '<', '/', 'L', 'e', 'n', 'g', 't', 'h', ' ', '4', ' ', '0', ' ', 'R', '/', 'F', 'i', 'l', 't', 'e', 'r', '/', 'F', 'l', 'a', 't', 'e', 'D', 'e', 'c', 'o', 'd', 'e', '>', '>', 's', 't', 'r', 'e', 'a', 'm', '\n', 'H', '\x89', '\xa4', 'W', '\xd9', 'r', 'T', '\xc9', '\x11', '\xfd', '\x82', '\xfb', '\x0f', '\xf5', '\xd8', '\n', '\x8f', '\x8a', '\xda', '\x97', 'G', '!', '\x04', '\x06', '\x03']
On a PDF I happen to have on my desktop (Its a IC Datasheet LTC1450)
Using "rb" (Read Binary):
f = open("14500lf.pdf", "rb")
data = list(f.read())
print data
['%', 'P', 'D', 'F', '-', '1', '.', '5', '\r', '%', '\xe2', '\xe3', '\xcf', '\xd3', '\r', '\n', '1', ' ', '0', ' ', 'o', 'b', 'j', '<', '<', '/', 'C', 'o', 'n', 't', 'e', 'n', 't', 's', ' ', '3', ' ', '0', ' ', 'R', '/', 'T', 'y', 'p', 'e', '/', 'P', 'a', 'g', 'e', '/', 'P', 'a', 'r', 'e', 'n', 't', ' ', '8', '7', ' ', '0', ' ', 'R', '/', 'T', 'h', 'u', 'm', 'b', ' ', '7', '1', ' ', '0', ' ', 'R', '/', 'R', 'o', 't', 'a', 't', 'e', ' ', '0', '/', 'M', 'e', 'd', 'i', 'a', 'B', 'o', 'x', '[', '0', ' ', '0', ' ', '6', '1', '2', ' ', '7', '9', '2', ']', '/', 'C', 'r', 'o', 'p', 'B', 'o', 'x', '[', '0', ' ', '0', ' ', '6', '1', '2', ' ', '7', '9', '2', ']', '/', 'R', 'e', 's', 'o', 'u', 'r', 'c', 'e', 's', ' ', '2', ' ', '0', ' ', 'R', '>', '>', '\r', 'e',....Snip a few thousand lines...
'9', '1', ' ', '0', ' ', 'R', '/', 'I', 'D', '[', '<', 'd', 'd', '3', 'd', '2', '8', '5', 'e', '1', 'd', '9', '0', '4', '6', 'e', '1', 'f', '6', 'e', '7', '0', '8', 'b', 'd', '8', 'e', '4', 'f', '9', 'b', '1', '3', '>', '<', '4', '3', '8', 'a', '7', '7', '2', '3', 'f', 'b', '2', '9', 'e', '7', '4', '6', 'a', '4', 'd', '4', '1', '6', 'a', 'f', '7', '6', '2', 'd', '8', '0', '9', '5', '>', ']', '>', '>', '\r', '\n', 's', 't', 'a', 'r', 't', 'x', 'r', 'e', 'f', '\r', '\n', '2', '9', '0', '2', '6', '9', '\r', '\n', '%', '%', 'E', 'O', 'F', '\r', '\n']