views:

38

answers:

4

I have used iText to parse pdf files. It works well on local files but I want to parse pdf files which are hosted in web servers like this one:

"http://protege.stanford.edu/publications/ontology_development/ontology101.pdf"

but I don't know how??? Could you please answer me how to do this task using iText or other libraries... thx

A: 

You need to download the bytes of the PDF file. You can do this with:

URL url = new URL("http://.....");
URLConnection conn = url.getConnection();

if (conn.getResponseCode() != HttpURLConnection.HTTP_OK) { ..error.. }
if ( ! conn.getContentType().equals("application/pdf")) { ..error.. }

InputStream byteStream = conn.getInputStream();
try {
  ... // give bytes from byteStream to iText
} finally { byteStream.close(); }
Adrian Smith
A: 

Use the URLConnection class:

URL reqURL = new URL("http://www.mysite.edu/mydoc.pdf" );
URLConnection urlCon = reqURL.openConnection();

Then you can use the URLConnection method to retrieve the content. Easiest way:

InputStream is = urlCon.getInputStream();
byte[] b = new byte[1024]; //size of a buffer, can be any
int len;
while((len = is.read(b)) != -1){
    //Store the content in preferred way
}
is.close();
Frozen Spider
A: 

thinks

The problem is that : when I execute the code on a html website, it works well , but when I execute it on a pdf website, obtain strange characters like this:

£$Ëa’-ÕUø4¸s?·uTd×sËÌ•‹Éæ $ÒÒke.ÆýLÞg,ðÿ Z5¼®bÔGìÄ;¾q}!Zu2Ù·.Ûsn=ö•å?ÛZt? 

PDFs usually contain binary data, that's normal. Images, fonts, compressed content streams, you name it.
Mark Storer
A: 

Nothing to it. You can pass a URL directly into PdfReader, and let it handle the streaming for you:

URL url = new URL("http://protege.stanford.edu/publications/ontology_development/ontology101.pdf" );
PdfReader reader = new PDFReader( url );

The JavaDoc is your friend.

Mark Storer