views:

32

answers:

1

Hi there!

I'm trying to build some sort of webservice on google apps.

Now the problem is, I need to get data from a website (HTML Scraping).

The request looks like :

URL url = new URL(p_url);
con = (HttpURLConnection) url.openConnection();
InputStreamReader in = new InputStreamReader(con.getInputStream());
BufferedReader reader = new BufferedReader(in);

        String result = "";
        String line = "";
        while((line = reader.readLine()) != null)
        {
            System.out.println(line);
        }
        return result;

Now App Engine gives me the follwing exception at the 3th line:

com.google.appengine.api.urlfetch.ResponseTooLargeException

This is because the maximum request limit is at 1mb and the total HTML from the page is about 1.5mb.

Now my question: I only need the first 20 lines of the html to scrape. Is there a way to only get a part of the HTML so that the ResponseTooLargeException will not be thrown?

Thanks in advance!

A: 

Solved the problem by using the low level URLFetch api.

And setting the allowtruncate option to true;

http://code.google.com/intl/nl-NL/appengine/docs/java/javadoc/com/google/appengine/api/urlfetch/FetchOptions.html

Basicly it works like this :

HTTPRequest request = new HTTPRequest(_url, HTTPMethod.POST, Builder.allowTruncate());
URLFetchService service = URLFetchServiceFactory.getURLFetchService();
HTTPResponse response = service.fetch(request);
wh0emPah
As per the docs: The URL Fetch service limits the size of the data for an outgoing request, and for an incoming response. When using the java.net API, data larger than the limit is silently truncated. The low-level URL Fetch API lets you specify whether truncation happens silently, or whether exceeding a limit throws an exception
Romain Hippeau