views:

160

answers:

2

This is a general design question about how to make a web application that will receive a large amount of uploaded data, process it, and return a result, all without the dreaded spinning beach-ball for 5 minutes or a possible HTTP timeout.

Here's the requirements:

  • make a web form where you can upload a CSV file containing a list of URLs
  • when the user clicks "submit", the server fetches the file, and checks each URL to see if its alive, and what the title tag of the page is.
  • the result is a downloadable CSV file containing the URL, and the result HTTP code
  • the input CSV can be very large ( > 100000 rows), so the fetch process might take 5-30 minutes.

My solution so far is to have a spinning javascript loop on the client site, which queries the server every second to determine the overall progress of the job. This seems kludgy to me, and I'm hesitant to accept this as the best solution.

I'm using perl, template toolkit, and jquery, but any solution using any web technology would be acceptable.

edit: An example of a possible solution is in this question: http://stackoverflow.com/questions/333664/simple-long-polling-example-code

+3  A: 

You can do this with AJAX but you may get better real-time results with a COMET like implementation. I believe that COMET implementations are specifically designed to get around some timeout limitations but I haven't used any so I can't offer a direct guide.

Either way my recommendation is to hand off the work to another process once it gets to the server.

I've worked a number of different solutions for batch tasks of this nature and the one I like the best is to hand off the batch work to another process. In such a system the upload page hands off the work to a separate processor and returns immediately with instructions for the user to monitor the process.

The batch processor can be implemented in a couple of ways:

  • Fork and detach the child from IO to complete the batch processing. The parent completes the web request.
  • Save the upload content to a processing queue (e.g.: file on the file system, records in a database) and have the web server notify an external processor - either a custom daemon, or an off the shelf scheduler like "at" for *nix systems.

You can then offer the user multiple ways to monitor the process:

  • The upload confirmation page contains a synchronous live monitor of the batch process (via COMET or Flash). When complete the confirmation page can direct the user to their download.
  • Like above but the monitor is not live but instead uses periodic polling via AJAX or page meta refresh
  • A queue monitor page that shows them the status of any batch process they have running.

The batch processor can communicate it's status via a number of methods:

  • Update a record in the database
  • Generate a processing log
  • Use a named pipe

There are a number of benefits to handing the code off to another process:

  • The process will continue WHEN a user accidentally stops the browser.
  • Using an external process forces you to communicate batch status in a way that allows you to detach your monitor and re-attach any time. E.g.: WHEN a user accidentally navigates away from the page before the process is complete.
  • It's easier to implement batch throttling and postponement if you decide you need to spread out your batch processing to occur during low web traffic hours.
  • You don't have to worry about web timeouts (either client side or server side).
  • You can restart the web server without worrying about whether you're interrupting a batch process.
benrifkah
I was hoping to avoid a polling method, but its looking like that is not possible without using flash or other bytecode.
David Dombrowsky
+1  A: 

The simplest would be to batch process or even to stream the job. If you treat it like a data table you have on your page. If the table has > 100000 records would you just request all records at once. I would do this:

  1. Send a request to download file.

  2. Send a request to process 100 (arbitrary number) records.

    a. Process records.

    b. Save to temporary csv file.

    c. Response back with status of complete / not complete process.

    d. If status is not complete repeat step two.

Gutzofter