views:

590

answers:

7

I'm working on a web service at the moment and there is the potential that the returned results could be quite large ( > 5mb).

It's perfectly valid for this set of data to be this large and the web service can be called either sync or async, but I'm wondering what people's thoughts are on the following:

  1. If the connection is lost, the entire resultset will have to be regenerated and sent again. Is there any way I can do any sort of "resume" if the connection is lost or reset?

  2. Is sending a result set this large even appropriate? Would it be better to implement some sort of "paging" where the resultset is generated and stored on the server and the client can then download chunks of the resultset in smaller amounts and re-assemble the set at their end?

+2  A: 

There's no hard law against 5 Mb as a result set size. Over 400 Mb can be hard to send.

You'll automatically get async handlers (since you're using .net)

implement some sort of "paging" where the resultset is generated and stored on the server and the client can then download chunks of the resultset in smaller amounts and re-assemble the set at their end

That's already happening for you -- it's called tcp/ip ;-) Re-implementing that could be overkill.

Similarly --

entire resultset will have to be regenerated and sent again

If it's MS-SQL, for example that is generating most of the resultset -- then re-generating it will take advantage of some implicit cacheing in SQL Server and the subsequent generations will be quicker.

To some extent you can get away with not worrying about these problems, until they surface as 'real' problems -- because the platform(s) you're using take care of a lot of the performance bottlenecks for you.

Leon Bambrick
A: 

I somewhat disagree with secretGeek's comment:

That's already happening for you -- it's called tcp/ip ;-) Re-implementing that could be overkill.

There are times when you may want to do just this, but really only from a UI perspective. If you implement some way to either stream the data to the client (via something like a pushlets mechanism), or chunk it into pages as you suggest, you can then load some really small subset on the client and then slowly build up the UI with the full amount of data.

This makes for a slicker, speedier UI (from the user's perspective), but you have to evaluate if the extra effort will be worthwhile... because I don't think it will be an insignificant amount of work.

Mike Stone
A: 

To some extent you can get away with not worrying about these problems, until they surface as 'real' problems

I like this solution :) It reminds me of the quote that I think paul graham is always throwing around when talking to startups:

"Scale isn't your problem. Getting people to give a *@#$ is"

and whilst this is somewhat true at this stage, I was more wondering in broader terms about if there are better ways to deal with transmitting large resultsets via web services or if the general consensus is to see what happens... if it fails... suck it up and try again.

lomaxx
A: 

Mike, It's an interesting point you make about providing a small chunk up front and then providing more chunks at a later stage. In a client application like the one we are building I can definately see some real benefit there.

It does mean more work on the client and the server to cater for this, but at the same time, it sure is better than waiting ~1 minute or so for the full resultset to download.

lomaxx
A: 

So it sounds like you'd be interested in a solution that adds 'starting record number' and 'final record number' parameter to your web method. (or 'page number' and 'results per page')

This shouldn't be too hard if the backing store is sql server (or even mysql) as they have built in support for row numbering.

Despite this you should be able to avoid doing any session management on the server, avoid any explicit caching of the result set, and just rely on the backing store's caching to keep your life simple.

Leon Bambrick
A: 

The first question that you need to ask yourself is how fast is the network you client and server exist in. If both are inside the same network and it is your standard corporate LAN then you are likely looking at 100 Mbit/s. Most people aren't even going to notice the network delay in these cases so I would say that it is nothing to worry about.

.NET should throw an exception if something goes wrong, once again network speeds come in to play here as you might be able to just re-download the data without the user noticing.

You might want to examine the data to make sure you are only getting the data that you need, but if you need all of the data to do the task at hand, I don't see why there would be an issue with the size. You might also want to look into compression to see if there are any ways to make it a bit smaller if size and speed are a major issue.

Rob
+1  A: 

I have seen all three approaches, paged, store and retrieve, and massive push.

I think the solution to your problem depends to some extent on why your result set is so large and how it is generated. Do your results grow over time, are they calculated all at once and then pushed, do you want to stream them back as soon as you have them?

In my experience, using a paging approach is appropriate when the client needs quick access to reasonably sized chunks of the result set similar to pages in search results. Considerations here are overall chattiness of your protocol, caching of the entire result set between client page requests, and/or the processing time it takes to generate a page of results.

Store and retrieve is useful when the results are not random access and the result set grows in size as the query is processed. Issues to consider here are complexity for clients and if you can provide the user with partial results or if you need to calculate all results before returning anything to the client (think sorting of results from distributed search engines).

The massive push approach is almost certainly flawed. Even if the client needs all of the information and it needs to be pushed in a monolithic result set, I would recommend taking the approach of WS-ReliableMessaging (either directly or through your own simplified version) and chunking your results. By doing this you 1) ensure that the pieces reach the client, 2) you can discard the chunk as soon as you get a receipt from the client, and 3) you can reduce the possible issues with memory consumption from having to retain 5MB of XML, DOM, or whatever in memory (assuming that you aren't processing the results in a streaming manner) on the server and client sides.

Like others have said though, don't do anything until you know your result set size, how it is generated, and overall performance to be actual issues.

DavidValeri