tags:

views:

38

answers:

1

I have a code for get pagecontent from a URL:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;

public class GetPageFromURLAction extends Thread {

    public String stringPageContent;
    public String targerURL;

    public  String getPageContent(String targetURL) throws IOException {
            String returnString="";
            URL urlString = new URL(targetURL);
            URLConnection openConnection = urlString.openConnection();
            String temp;
             BufferedReader in = new BufferedReader( newInputStreamReader(openConnection.getInputStream()));
                while ((temp = in.readLine()) != null) 
                {
                    returnString += temp + "\n";        
                }       
                in.close();
              //  String nohtml = sb.toString().replaceAll("\\<.*?>","");
                return returnString;

     }

    public String getStringPageContent() {
        return stringPageContent;
    }

    public void setStringPageContent(String stringPageContent) {
        this.stringPageContent = stringPageContent;
    }

    public String getTargerURL() {
        return targerURL;
    }

    public void setTargerURL(String targerURL) {
        this.targerURL = targerURL;
    }

    @Override
    public void run() {
        try {
            this.stringPageContent=this.getPageContent(targerURL);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

}

Sometimes I receive an HTTP error of 405 or 403 and result string is null. I have tried checking permission to connect to the URL with:

    URLConnection openConnection = urlString.openConnection();
    openConnection.getPermission()

but it usualy returns null. Does mean that i don't have permission to access the link?

I have tried stripping off the query portion of the URL with:

String nohtml = sb.toString().replaceAll("\\<.*?>","");

where sb is a Stringbulder, but it doesn't seem to strip off the whole query substring.

In an unrelated question, I'd like to use threads here because I must retrieve many URLs; how can I create a multi-thread client to improve the speed?

+1  A: 

The relevant error definitions are:

403 Forbidden

The server understood the request, but is refusing to fulfill it. Authorization will not help and the request SHOULD NOT be repeated. If the request method was not HEAD and the server wishes to make public why the request has not been fulfilled, it SHOULD describe the reason for the refusal in the entity. If the server does not wish to make this information available to the client, the status code 404 (Not Found) can be used instead.

405 Method Not Allowed

The method specified in the Request-Line is not allowed for the resource identified by the Request-URI. The response MUST include an Allow header containing a list of valid methods for the requested resource.

So yes, a 403 means that you are do not have permission, and stripping off the query probably won't help at all.

A 405 means that you haven't formulated your GET correctly, but it wouldn't surprise me if there are servers which really mean 403 when they return 405.

In both cases, you should probably consider the URL permanently inaccessible.

msw
Thanks your rely !But why when i use a use getpermission to check Perrmision it usualy return null ?
tiendv
getPermission relates to Java 2 security only. It is unrelated to the permissions required or checked by the remote server.
bkail
so how can i control the error when it return ? example return a string "don't have permission " when recive erro 403,405 somthing like that !
tiendv