views:

494

answers:

4

We use SharpZipLib. We need to be able to unzip files on server and place them in separate folder. The request to unzip a file will be from user on a web page. I imagine if the files are large enough it will take a long time to unzip. We don't want users to be stuck on the page while waiting for unzip to complete in order to continue browsing the site.

What is a good way to handle this scenario: spin off a different thread to take care of unzipping file, create a separate windows service that will unzip files, or ....what?

What are pros and cons of doing it via separate thread or window service?

A: 

Personally I'd go down the Windows Service route with messaging between them for progress, such as return a handle to the unzip which can be used to monitor status.

However you could also I think perhaps spin off a thread to do it and that will happily execute and the page returns.

Lloyd
What are pros and cons of doing it via separate thread or window service?
dev.e.loper
A: 

I would use an asynchronous process that you can easily poll from an AJAX enabled page. When complete, the AJAX portion of the page can present the details you normally would have presented while the user waited for process to complete synchronously.

Wayne Hartman
+1  A: 

Advantages of a separate process
Work done in a separate process can be decoupled in time, as well as physically, and from a security standpoint, from the page flow. Decoupled in time: If you choose, you can buffer the requests to unzip things until "later" when load is lower and when you have spare cpu cycles to do it.

Also decoupled physically; for a large scale system, you could have multiple worker processes, even deployed on multiple independent machines, doing this work asynchronously, and that layer of processing can scale independently of the web page processing. In any system there are bottlenecks, and the advantage of distributed deployments is you can scale the separate workloads independently, to more efficiently eliminate bottlenecks.

I would say though, that this latter benefit is only useful in very very large scale systems. In most cases you won't have the kind of transaction volume that would benefit from an independent physical scaling layer. This is true not just of your workload, but of 98% of all workloads. The YAGNI principle applies to scalability, too.

Physical decoupling also allows the disparate workloads (page flow and zip unpack) to be developed independently. In other words, supposing the workitem was not a simple "unzip a file" but was something more complex, with multiple steps and decision points along the way. Designing the work processor in a separate process allows the page flow to be built and tested independently from the workitem processing. This can be a nice advantage if they have to evolve independently.

This physical decoupling is also nice if workitems will arrive via different channels. Suppose the web page is not the only way for a workitem to arrive. Suppose you have an ftp drop, a web service, or a machine-monitored email box that can also receive workitems. In that cases it would makes sense to have the workitem processing physicall decoupled from the web page processing.

Finally, these things are decoupled in security at runtime. In some web app server deployments, security rules prohibit the web server from writing to the disk - web servers have no writable disk storage. A separate asynch worker process can be deployed in a separate part of the network, with plenty of storage and it perhaps is constrained by a separate set of security requirements. This may or may not be applicable to you.

Advantages of Threaded processing
The advantage of doing the work in a separate thread is that it is much simpler. Decoupling introduces complexity and cost. Managing the work in a separate thread, you don't have any of the operational overhead of managing a separate process, potentially a separate machine. There's no additional configuration, no new build/deployment step. No additional backup. No additional security identity to maintain. No communication interchange to worry about (beyond the thread dispatch).

You could choose to get a little more sophisticated about workitem processing, and optionally do the work synchronously when the zipfile looks small enough. Suppose you establish a threshold of 4 seconds response time - above that, you need asynchronous workload, below 4 seconds, you do it "inline". Of course you never know for sure how long a zipfile will take, but you couldd establish a good heuristic based on the size of the file. This optimization is available to you whether you use an external process for async work, or a separate thread, but to be honest, it is simpler to take advantage of the optimization when using a separate thread. Less additional work to do. So this is an advantage for the threaded approach.

Non Differentiators
If you choose to have an AJAX polling mechanism for notification of workitem status, that would work with either the separate process or the separate thread. I don't know how you would do work item tracking, but I would suppose that when a particular work item (zip file?) is completed, then you will update a record somewhere - a file in a filesystem, a table in a database. That update happens whether it is being done by a thread in the same process, or by a separate process (Windows Service). So the AJAX client that polls will just check the db table or filesystem in any case, and will get the notification of workitem status in the same way, regardless of your architecture decision.

How to decide
The theory is interesting but ultimately useless, without actual operating constraints.

Workload is one of the key real-world items. You didn't say how large these zip files are, but I am guessing they are "regular sized". Something about 4gb or less. Normally a zipfile like that takes 20-60 seconds to unpack on my laptop, but of course on a server with a real storage system and faster CPU, it will be less. You also did not characterize the concurrency of transactions - how many of these things will be happening at any one time. I'm assuming concurrency is not particularly high.

If that is the case, I would stick to the simpler async thread approach. You are doing this in ASP.NET, I presume on a server OS. The CLR has good thread management, and ASP.NET has good process scale-out capability. So even in high workloads, you will get good CPU utilization and scale, without a ton of configuration effort.

If the workitems were longer running - let's say on the order of hours or even days, and the time was unpredictable (like the closing of a stock order) - well in that case I would lean toward an async process. If the concurrency was in the thousands per second, or again very unpredictable, that also would recommend a separate process. If the failure modes were complex enough, I might want the workitems to be in a separate process just to manage it. If the workitem processing were likely to change regularly (adding an additional step, according to evolving business conditions), I might want it in a separate process.

But none of those things seem to be true in your case - unpacking zip files.

Cheeso
+1  A: 

The disadvantages of a separate thread are:

  1. When the page ends there is no easy way of getting notification on what the other thread is doing.
  2. The application could be restarted at any point.
  3. It would be easy to accidentally start off the process twice if the user submits the page twice in quick succession.
  4. Multithreaded code is hard to debug.

The advantages of a separate thread are:

  1. Less code
  2. Easy to do fire and forget if the user doesn't need to be notified when the unzip completes.
  3. No extra work to install.

The advantages and disadvantages of a windows service are roughly the opposite of the above.

Jonathan Parker