ansaurus

Question

Answer 1

+2 A:

Have you tried looking inside the downloaded file with for example a text editor?

You'll see that it contains a HTML page, and not a PDF. Probably the URL does not point to the PDF, or there is some redirecting going on, which the standard java.net classes don't support by default.

Make sure the URL correctly points to the PDF. You could use Apache HttpClient for doing more sophisticated things with HTTP, including automatically handling HTTP redirects.

Note: The code you posted does not compile, because you placed a } wrongly.

Jesper 2009-09-04 09:51:09

That code *does* point to the PDF, I believe. He appends filename to URL.

Brian Agnew 2009-09-04 09:56:13

Now, it compiles

Sergio del Amo 2009-09-04 09:58:17

I opened the PDF with an editor and there is an html file inside

Sergio del Amo 2009-09-04 10:00:48

Answer 2

+1 A:

Inspect the resultant file - I expect it is a HTML file. The site probably returns an error if there is no referrer or uses a JavaScript redirect page or something. You can use the HttpURLConnection class to check the HTTP headers returned by the server.

URL url = new URL(
    "http://www.nbc.com/Heroes/novels/downloads/Heroes_novel_001.pdf");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("HEAD");
try {
  for (Map.Entry<String, List<String>> header : conn.getHeaderFields()
      .entrySet()) {
    System.out.println(header.getKey() + "=" + header.getValue());
  }
} finally {
  conn.disconnect();
}

The above code returns a Content-Type of text/html.

McDowell 2009-09-04 10:02:58

You are right. I opened it with an editor and there is html inside

Sergio del Amo 2009-09-04 10:06:12

Answer 3

+1 A:

For this kind of exploration, I highly recommend Jython (or Groovy, or ...). For example:

C:\Users\Vinay>jython
Jython 2.5.0 (Release_2_5_0:6476, Jun 16 2009, 13:33:26)
[Java HotSpot(TM) Client VM (Sun Microsystems Inc.)] on java1.6.0_16
Type "help", "copyright", "credits" or "license" for more information.
>>> s = "http://www.nbc.com/Heroes/novels/downloads/Heroes_novel_001.pdf"
>>> import java.net
>>> import jarray
>>> u = java.net.URL(s)
>>> os = u.openStream()
>>> buffer = jarray.zeros(1024, 'b')
>>> n = os.read(buffer, 0, 1024)
>>> java.lang.String(buffer)

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.1//EN"
    "http://www.w3.org/TR/xhtml-basic/xhtml-basic11.dtd"&gt;

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>

<meta name="viewport" content="width=240, user-scalable=yes" />
<HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
<META HTTP-EQUIV="Expires" CONTENT="-1">
<meta http-equiv="Cache-control" content="no-cache">
<meta http-equiv="Cache-control" content="must-revalidate">
<meta http-equiv="Cache-control" content="max-age=0">
 meta http-equiv="refresh" content="200">
<title>NBC.com: Heroes</title>
<link rel="stylesheet" type="text/css"  href="/style/default.css?sid=c67ddc30f79
ec4cc811f6e67e383fed7" />
<link rel="stylesheet" type="text/css"  href="/style/hro.css?sid=c67ddc30f79ec4c
c811f6e67e383fed7" />

</head>
<body>
<center><img src="http://oimg.nbcuni.com/b/ss/nbcunbcnetworkwapbu,nbcuwapsitebu/
5/H.8--WAP/4aa0e7ce2535c?vid=c67ddc30f79ec4cc811f6e67e383fed7&gn=NBC.com Front
>>>

which confirms what you found, but without edit/compile cycles to get in the way. Just my 2 cents...

As for how to get the data - it may be that you have to spoof your User-Agent header. From Firefox, the same URL returns a Content-Type of application/pdf, and the PDF file.

Update: The following Jython script:

import java.net
import jarray

s = "http://www.nbc.com/Heroes/novels/downloads/Heroes_novel_001.pdf"
u = java.net.URL(s)
c = u.openConnection()
c.setRequestProperty("User-Agent", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.2) Gecko/20090810 Ubuntu/9.10 (karmic) Firefox/3.5.2")
BUFLEN = 4
buffer = jarray.zeros(BUFLEN, 'b')
c.connect()
stream = c.getInputStream()
stream.read(buffer, 0, BUFLEN)
data = java.lang.String(buffer)
print data

prints

%PDF

so the site is looking at the User-Agent header.

Vinay Sajip 2009-09-04 10:16:10

How can I spoof the User-Agent Header?

Sergio del Amo 2009-09-04 10:29:26

If you are sticking with Java's `HttpURLConnection`, set it as a request property prior to connecting. _(Note that spoofing the user agent might work in this case, but it is only one of a number of tricks web servers can use to differentiate between real browsers and bots/spiders/etc.)_

McDowell 2009-09-04 10:46:07

Answer 4

+1 A:

This is the same issue with your other question. NBC.com doesn't send back PDF to you if it thinks you are a scrapper :)

Same tricks will do,

conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13");

ZZ Coder 2009-09-04 10:49:40

ansaurus

tags:

views:

answers:

Downloaded PDF with Java is corrupt?

related questions