views:

374

answers:

4

I have read the excelent discussion about How to download and save a file from internet using Java. However, if I exectue the next code, i get a corrupt PDF. Any idea why?

import java.io.*;
import java.net.*;

public class PDFDownload {
    public static String URL = "http://www.nbc.com/Heroes/novels/downloads/";
    public static String FOLDER = "C:/Users/sdelamo/workspace/SandBox/HeroesNovel/";

    public static void main(String[] args) {
     String filename = "Heroes_novel_001.pdf";
     try {
      saveUrl(FOLDER + filename, URL + filename);
     } catch (MalformedURLException e) {
      System.out.println("MalformedURLException");
     } catch (IOException e) {
      System.out.println("IOException");                              
     }                       
    }       



    public static void saveUrl(String filename, String urlString) throws MalformedURLException, IOException {
     BufferedInputStream in = null;
     FileOutputStream fout = null;
     try {
      URL url = new URL(urlString);
      in = new BufferedInputStream(url.openStream());
      fout = new FileOutputStream(filename);

      byte data[] = new byte[1024];
      int count;
      while ((count = in.read(data, 0, 1024)) != -1) {
       fout.write(data, 0, count);
      }
     } finally {
      if (in != null)
       in.close();
      if (fout != null)
       fout.close();
     }
    }
}

The above code downloads html instead of a PDF. This is the output:

<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.1//EN"
    "http://www.w3.org/TR/xhtml-basic/xhtml-basic11.dtd"&gt;

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>

<meta name="viewport" content="width=240, user-scalable=yes" />
<HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
<META HTTP-EQUIV="Expires" CONTENT="-1">
<meta http-equiv="Cache-control" content="no-cache">
<meta http-equiv="Cache-control" content="must-revalidate">
<meta http-equiv="Cache-control" content="max-age=0">
<meta http-equiv="refresh" content="200">

<title>NBC.com: Heroes</title>
<link rel="stylesheet" type="text/css"  href="/style/default.css?sid=8a9212f822e1c675330ec418bc531169" />
<link rel="stylesheet" type="text/css"  href="/style/hro.css?sid=8a9212f822e1c675330ec418bc531169" /> 

</head>
<body>
<center><img src="http://oimg.nbcuni.com/b/ss/nbcunbcnetworkwapbu,nbcuwapsitebu/5/H.8--WAP/4aa0e4cb8b448?vid=8a9212f822e1c675330ec418bc531169&amp;gn=NBC.com Front Door&c2=&c3=Miscellaneous&c4=&c6=m.nbc.com/show/hro&c8=TV Entertainment&c9=NBC Network&c10=&c11= | &c12= | &c25=offdeck&c27=internal&c29=&c44=D=User-Agent&r=" width="5" height="5" border="0" /></center>
<h1 id="fHeader">
<a  href="/?sid=8a9212f822e1c675330ec418bc531169">
<img src="/images/nbc_logo.gif" alt="NBC : logo" border="0" />
</a>
</h1>

<h2>
<a  href="/show/hro?sid=8a9212f822e1c675330ec418bc531169">
<img src="/images/shows/1221684699_Heroes_WAP_166x54.jpg" alt="Heroes : showheader" border="0" />
</a>
</h2>
<div id="tunein_nexton">
    <span id="tunein">Mondays 9/8c</span>
</div><!--end #tunein_nexton-->
<div id="tunein_nexton">
    <!--<span id="tunein">Mondays 8/7c</span>-->

    <p id="nexton"><span class="sectiontitle"></span></p>
</div><!--end #tunein_nexton-->
<div id="featuredcontent">
    <h3>FEATURED CONTENT</h3>
    <table id="featuredItemsTable">

     <tr>
      <td><a  href="/show/hro/videos.html?sid=8a9212f822e1c675330ec418bc531169"><img src="/images/hro/nbc_hro_pro_040X921HRO120FLYPSIDE_exp921_20090_543_large.jpg" alt="featured" /></a>
      </td>
      <td>
       <span class="ftitle">Dreams</span>
       <span class="fdesc">Heroes premieres Mon., Sept. 21s...</span>
      </td>
     </tr>
             <tr>
      <td><a  href="/show/hro/recaps.html?sid=8a9212f822e1c675330ec418bc531169"><img src="http://origin-www.nbc.com/Heroes/images/episodes/season3/325/hro_325_01.jpg" alt="featured" height="45" width="80"/></a>
      </td>
      <td>
       <span class="ftitle">Recap:</span>
       <span class="fdesc">Season 3 Episode An Invisible Thread</span>
      </td>
     </tr>
             <tr>
      <td><a  href="/show/hro/photos.html?sid=8a9212f822e1c675330ec418bc531169"><img src="http://origin-www.nbc.com/app2/img/200x200xS/scet/photos/51/3736/NUP_110031_0323.JPG" alt="featured" height="45" width="80"/></a>
      </td>
      <td class="finfo">
       <span class="ftitle">Photo:</span>
       <span class="fdesc">Heroes "Cast Photos"</span>
      </td>
     </tr>
        </table>


</div><!--end #featuredcontent-->

<h3>HEROES</h3>
<table class="showNav">
    <tr><td><a  href="/show/hro/about.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="1">About</a></td></tr>
     <tr><td><a  href="/show/hro/videos.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="2">Videos</a></td></tr>
       <tr><td><a  href="/show/hro/recaps.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="3">Episode Recaps</a></td></tr>
        <tr><td><a  href="/show/hro/photos.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="4">Photos</a></td></tr>
       <tr><td><a  href="/show/hro/community.html?sid=8a9212f822e1c675330ec418bc531169" accesskey="5">Community</a></td></tr>
    <tr><td><a  href="/shows.shtml?sid=8a9212f822e1c675330ec418bc531169" accesskey="6">Shows List</a></td></tr>
</table>
<!-- <a  href="http://www.insightexpress.com/ix/Survey.aspx?id=151580&amp;accessCode=3161643404&amp;sid=8a9212f822e1c675330ec418bc531169" ><img src="/images/mNBCcom_166x54.jpg" border="0"></a> -->



<div class="footer" align="center"><a  href="http://m.nbc.com?sid=8a9212f822e1c675330ec418bc531169"&gt;&lt;strong&gt;NBC Mobile Main</strong></a> | <a  href="/terms.shtml?sid=8a9212f822e1c675330ec418bc531169"><strong>Terms of Use</strong></a> | <a  href="/privacy.shtml?sid=8a9212f822e1c675330ec418bc531169"><strong>Privacy</strong></a></div><div class="cpyrt" align="center">&#169; NBC Universal, Inc.</div>

</body>
</html>

Any idea how to download the PDF?

SOLUTION

Set User-Agent before connecting.

URL u = new URL(urlString); 
HttpURLConnection huc =  (HttpURLConnection)  u.openConnection();
huc.setRequestMethod("GET"); 
huc.setRequestProperty("User-Agent", "  Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)");
huc.connect();    

in = new BufferedInputStream(huc.getInputStream());
+2  A: 

Have you tried looking inside the downloaded file with for example a text editor?

You'll see that it contains a HTML page, and not a PDF. Probably the URL does not point to the PDF, or there is some redirecting going on, which the standard java.net classes don't support by default.

Make sure the URL correctly points to the PDF. You could use Apache HttpClient for doing more sophisticated things with HTTP, including automatically handling HTTP redirects.

Note: The code you posted does not compile, because you placed a } wrongly.

Jesper
That code *does* point to the PDF, I believe. He appends filename to URL.
Brian Agnew
Now, it compiles
Sergio del Amo
I opened the PDF with an editor and there is an html file inside
Sergio del Amo
+1  A: 

Inspect the resultant file - I expect it is a HTML file. The site probably returns an error if there is no referrer or uses a JavaScript redirect page or something. You can use the HttpURLConnection class to check the HTTP headers returned by the server.

URL url = new URL(
    "http://www.nbc.com/Heroes/novels/downloads/Heroes_novel_001.pdf");
HttpURLConnection conn = (HttpURLConnection) url.openConnection();
conn.setRequestMethod("HEAD");
try {
  for (Map.Entry<String, List<String>> header : conn.getHeaderFields()
      .entrySet()) {
    System.out.println(header.getKey() + "=" + header.getValue());
  }
} finally {
  conn.disconnect();
}

The above code returns a Content-Type of text/html.

McDowell
You are right. I opened it with an editor and there is html inside
Sergio del Amo
+1  A: 

For this kind of exploration, I highly recommend Jython (or Groovy, or ...). For example:

C:\Users\Vinay>jython
Jython 2.5.0 (Release_2_5_0:6476, Jun 16 2009, 13:33:26)
[Java HotSpot(TM) Client VM (Sun Microsystems Inc.)] on java1.6.0_16
Type "help", "copyright", "credits" or "license" for more information.
>>> s = "http://www.nbc.com/Heroes/novels/downloads/Heroes_novel_001.pdf"
>>> import java.net
>>> import jarray
>>> u = java.net.URL(s)
>>> os = u.openStream()
>>> buffer = jarray.zeros(1024, 'b')
>>> n = os.read(buffer, 0, 1024)
>>> java.lang.String(buffer)
<?xml version="1.0" encoding="UTF-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML Basic 1.1//EN"
    "http://www.w3.org/TR/xhtml-basic/xhtml-basic11.dtd"&gt;

<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>

<meta name="viewport" content="width=240, user-scalable=yes" />
<HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
<META HTTP-EQUIV="Expires" CONTENT="-1">
<meta http-equiv="Cache-control" content="no-cache">
<meta http-equiv="Cache-control" content="must-revalidate">
<meta http-equiv="Cache-control" content="max-age=0">
 meta http-equiv="refresh" content="200">
<title>NBC.com: Heroes</title>
<link rel="stylesheet" type="text/css"  href="/style/default.css?sid=c67ddc30f79
ec4cc811f6e67e383fed7" />
<link rel="stylesheet" type="text/css"  href="/style/hro.css?sid=c67ddc30f79ec4c
c811f6e67e383fed7" />

</head>
<body>
<center><img src="http://oimg.nbcuni.com/b/ss/nbcunbcnetworkwapbu,nbcuwapsitebu/
5/H.8--WAP/4aa0e7ce2535c?vid=c67ddc30f79ec4cc811f6e67e383fed7&gn=NBC.com Front
>>>

which confirms what you found, but without edit/compile cycles to get in the way. Just my 2 cents...

As for how to get the data - it may be that you have to spoof your User-Agent header. From Firefox, the same URL returns a Content-Type of application/pdf, and the PDF file.

Update: The following Jython script:

import java.net
import jarray

s = "http://www.nbc.com/Heroes/novels/downloads/Heroes_novel_001.pdf"
u = java.net.URL(s)
c = u.openConnection()
c.setRequestProperty("User-Agent", "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.2) Gecko/20090810 Ubuntu/9.10 (karmic) Firefox/3.5.2")
BUFLEN = 4
buffer = jarray.zeros(BUFLEN, 'b')
c.connect()
stream = c.getInputStream()
stream.read(buffer, 0, BUFLEN)
data = java.lang.String(buffer)
print data

prints

%PDF

so the site is looking at the User-Agent header.

Vinay Sajip
How can I spoof the User-Agent Header?
Sergio del Amo
If you are sticking with Java's `HttpURLConnection`, set it as a request property prior to connecting. _(Note that spoofing the user agent might work in this case, but it is only one of a number of tricks web servers can use to differentiate between real browsers and bots/spiders/etc.)_
McDowell
+1  A: 

This is the same issue with your other question. NBC.com doesn't send back PDF to you if it thinks you are a scrapper :)

Same tricks will do,

conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.13) Gecko/2009073021 Firefox/3.0.13");
ZZ Coder