views:

44

answers:

1

I need to retrieve text from a remote web site that does not provide an RSS feed.

What I know is that the data I need is always on pages linked to from the main page (http://www.example.com/) with a link that contains the text " Invoices Report ".

For example:

<a href="http://www.example.com/data/invoices/2010/10/invoices-report---tuesday-october-12.html"&gt;Invoices Report - Tuesday, October 12</a>

So, I need to find all of the links on the main page that match this pattern and then retrieve all of the text from those pages that is situated inside a tag called <div class="invoice-body"> .

Are there Java tools that help with this and is there anything specifically for Google App Engine for Java that can be used to do this?

+4  A: 

Check out http://code.google.com/appengine/docs/java/urlfetch/overview.html

You can use the UrlFetch service to read www.example.com/index.html line-by-line, and use a regular expression to look for "Invoices Report."

URL url = new URL("http://www.example.com/index.html");
BufferedReader reader = new BufferedReader(new InputStreamReader(url.openStream()));
String line;

while ((line = reader.readLine()) != null) {
    checkLineForTextAndAddLinkOrWhatever(line);
}
reader.close();

You might need a different kind of reader if the link might be on multiple lines.

Riley