tags:

views:

42

answers:

1

I'm building a Java application which will download a HTML page from a website and save the file in my local system. I'm able to manually access the web page's URL via browser. But when I try to access the same URL in my Java program, the server returns a 503 Error. Here's the scenario:

sample URL = http://content.somesite.com/demo/somepage.asp

Able to access the above URL via browser. But the below Java code fails to download the page:

StringBuffer data = new StringBuffer();
BufferedReader br = null;
try {
    br = new BufferedReader(new InputStreamReader(sourceUrl.openStream()));
    String inputLine = "";
    while ((inputLine = br.readLine()) != null) {
        data.append(inputLine);
    }
} catch (Exception e) {
    e.printStackTrace();
} finally {
    br.close();
}

So, my questions are:

  1. Am I doing anything wrongly here?

  2. Is there a way for the server to block requests from programs/bots and allow only the requests coming from browsers?

+2  A: 

You may want to try setting the User-Agent and Referer HTTP headers to something like what a normal web browser would send.

You can pick a User-Agent string from this list: Seehowitruns: User-agent strings.

In addition, if the page you are requesting is an internal page, it might also depend on cookies which were generated in previous page.

Daniel Vassallo
In this case however, they probably do not want a bot to access their site. If your program is for more than just private use, you may need to check their terms of service.
Thilo