views:

12

answers:

0

I'm working on a timer EJB that fetches CSV data from a URL every 5 minutes and processes it into a database using OpenCSV. The relevant code is:

@Stateless
@TransactionManagement(TransactionManagementType.BEAN)
public class Importer {
    @Schedule(minute="*/5", hour="*", info="Importer")
    private void scheduledTimeout(final Timer t) {
        try {
            URL url = new URL("someurl");
            HttpURLConnection conn = (HttpURLConnection)url.openConnection();
            conn.setRequestMethod("GET");
            conn.setReadTimeout(15 * 1000);
            conn.setUseCaches(false);
            conn.connect();

            CSVReader reader = new CSVReader(new BufferedReader(new InputStreamReader(conn.getInputStream())));

            // parse data in loop, persist to database in batches
        } catch(Exception e) {
            // log error, roll back open transaction
        }
    }
}

Each CSV row contains a date indicating when the row was added, and it's supposed span the last hour and a half. I use the added date compared with the most recent date from the existing data to decide whether or not the record is new, and whether or not to bother inserting it into the database.

Now for the weird part: about 50% of the time this works just fine, ignores the old data and inserts the new rows. The other 50% of the time, though, nothing gets added to the database and my logs indicate that it's processing data from several days ago! The date of this old data closely coincides with the date and time I first got this system working in its present form (it was a standalone Java app before this), so I suspect something is being cached somewhere. But what's weird is the pattern isn't A, A, B, B, C, C (same data as last time cached & returned); it's A, B, C, A, D, A, E, ... (always the very first data set)

  • Could it be from OpenCSV? Since I'm constructing a new object with every run of the EJB, I don't see how.
  • Could it be Glassfish v3, the application server I'm using? The domain has been restarted a bunch of times and even recreated since the date of this old data.
  • I have a DNS cache on my network but there are no other proxies between me and the remote server.
  • Could the HTTP download of the data be timing out, with old data substituted?
  • Could the server be sending a Not Modified header causing something to return the original fetched copy?
  • It's possible there's an issue with the server I'm fetching the data from, but considering that I'm getting the very first data set back that this program ever fetched I don't think so.