I need to retrieve text from a remote web site that does not provide an RSS feed.
What I know is that the data I need is always on pages linked to from the main page (http://www.example.com/
) with a link that contains the text " Invoices Report
".
For example:
<a href="http://www.example.com/data/invoices/2010/10/invoices-report---tuesday-october-12.html">Invoices Report - Tuesday, October 12</a>
So, I need to find all of the links on the main page that match this pattern and then retrieve all of the text from those pages that is situated inside a tag called <div class="invoice-body">
.
Are there Java tools that help with this and is there anything specifically for Google App Engine for Java that can be used to do this?