views:

1039

answers:

10

I'm designing a multi-tiered database driven web application – SQL relational database, Java for the middle service tier, web for the UI. The language doesn't really matter.

The middle service tier performs the actual querying of the database. The UI simply asks for certain data and has no concept that it's backed by a database.

The question is how to handle large data sets? The UI asks for data but the results might be huge, possibly too big to fit in memory. For example, a street sign application might have a service layer of:

StreetSign getStreetSign(int identifier)
Collection<StreetSign> getStreetSigns(Street street)
Collection<StreetSign> getStreetSigns(LatLonBox box)

The UI layer asks to get all street signs meeting some criteria. Depending on the criteria, the result set might be huge. The UI layer might divide the results into separate pages (for a browser) or just present them all (serving up to Goolge Earth). The potentially huge result set could be a performance and resource problem (out of memory).

One solution is not to return fully loaded objects (StreetSign objects). Rather return some sort of result set or iterator that lazily loads each individual object.

Another solution is to change the service API to return a subset of the requested data:

Collection<StreetSign> getStreetSigns(LatLonBox box, int pageNumber, int resultsPerPage)

Of course the UI can still request a huge result set:

getStreetSigns(box, 1, 1000000000)

I'm curious what is the standard industry design pattern for this scenario?

A: 

In ASP.NET I would use server-side paging, where you only retrieve the page of data the user has requested from the data store. This is opposed to retrieving the entire result set, putting it into memory and paging through it on request.

Ty
A: 

JSF or JavaServerFaces has widgets for chunking large result sets to the browser. It can be parameterized as you suggest. I wouldn't call it a "standard industry design pattern" by any means, but it is worth a look to see how someone else solved the problem.

dacracot
+1  A: 

I would say if the potential exsists for a large set of data, then go the paging route.

You can still set a MAX that you do not want them to go over.

E.G. SO uses page sizes of 15, 30, 50...

Brian Schmitt
A: 

When I deal with this type of issue, I usually chunk the data sent to the browser (or thin/thick client, whichever is more appropriate for your situation) as regardless of the actual total size of the data that meets some certain criteria, only a small portion is really usable in any UI at one time.

I live in a Microsoft world, so my primary environment is ASP.Net with SQL Server. Here are two articles about paging (which mention some techniques for paging through result sets) that may be helpful:

Paging through lots of data efficiently (and in an Ajax way) with ASP.NET 2.0 Efficient Data Paging with the ASP.NET 2.0 DataList Control and ObjectDataSource

Another mechanism that Microsoft has shipped lately is their idea of "Dynamic Data" - you might be able to check out the guts of this for some guidance as to how they're dealing with this issue.

Gunny
A: 

I've done similar things on two different products. In one case the data source is optionally paginated -- for java, implements a Pageable interface similar to:

public interface Pageable
{
    public void setStartIndex( int index );
    public int getStartIndex();
    public int getRowsPerPage() throws Exception;
    public void setRowsPerPage( int rowsPerPage );
}

The data source implements another method for get() of items, and the implementation of a paginated data source just returns the current page. So you can set your start index, and grab a page in your controller.

One thing to consider will be to cache your cursors server side. For a web app you'll have to expire them, but they'll really help performance wise.

Niniki
A: 

The fedora digital repository project returns a maximum number of results with a result-set-id. You then get the rest of the result by asking for the next chunk supplying the result-set-id in the subsequent query. It works ok as long as you don't want to do any searching or sorting outside of the query.

MattSmith
+1  A: 

The most frequent pattern I've seen for this situation is some sort of paging, usually done server-side to reduce the amount of information sent over the wire.

Here's a SQL Server 2000 example using a table variable (generally faster than a temp table) together with your street signs example:

CREATE PROCEDURE GetPagedStreetSigns
(
  @Page int = 1,
  @PageSize int = 10
)
AS
  SET NOCOUNT ON

  -- This memory-variable table will control paging
  DECLARE @TempTable TABLE (RowNumber int identity, StreetSignId int)

  INSERT INTO @TempTable
  (
     StreetSignId
  )
  SELECT [Id]
  FROM   StreetSign
  ORDER BY [Id]

  -- select only those rows belonging to the requested page
  SELECT SS.*
  FROM   StreetSign SS
         INNER JOIN @TempTable TT ON TT.StreetSignId = SS.[Id]
  WHERE  TT.RowNumber BETWEEN ((@Page - 1) * @PageSize + 1) 
                      AND (@Page * @PageSize)

In SQL Server 2005, you can get more clever with stuff like Common Table Expressions and the new SQL Ranking functions. But the general theme is that you use the server to return only the information belonging to the current page.

Be aware that this approach can get messy if you're allowing the end-user to apply on-the-fly filters to the data that s/he's seeing.

RoadWarrior
+5  A: 

The very first question should be:

¿The user needs to, or is capable of, manage this amount of data?

Although the result set should be paged, if its potentially size is so huge, the answer will be "probably not", so the UI shouldn't try to show it.

I worked on J2EE projects on Health Care Systems, that deal with enormous amount of stored data, literally millions of patients, visits, forms, etc, and the general rule is not to show more than 100 or 200 rows for any user search, advising the user that those set of criteria produces more information that he can understand.

The way to implement this varies from one project to another, it is possible to force the UI to ask the service tier the size of a query before launching it, or it is possible to throw an Exception from the service tier if the result set grows too much (however this way couples the service tier with the limited implementation of an UI).

Be careful! This not means that every method on the service tier must throw an Exception if its result sizes more than 100, this general rule only applies to result sets that are shown to the user directly, that is a better reason to place the control in the UI instead on the service tier.

RogueOne
+1  A: 

One thing to be wary of when working with home-grown row-wrapper classes like you (apparently) have, is code that makes additional calls to the database without you (the developer) being aware of it. For example, you might call a method that returns a collection of Person objects and think that the only thing going on under the hood is a single "SELECT * FROM PERSONS" call. In actuality, the method you're calling might iterate through the returned collection of Person objects and make additional DB calls to populate each Person's Orders collection.

As you say, one of your solutions is to not return fully-loaded objects, so you're probably aware of this potential problem. One of the reasons I tend to avoid using row wrappers is that they invariably make it difficult to tune your application and minimize the size and frequency of database traffic.

MusiGenesis
A: 

From the datay retrieval layer, the standard design pattern is to have two method interfaces, one for all and one for a block size.

If you wish, you can layer components that do paging over it.