views:

272

answers:

6

How do you organize DB layer, business logic and cross-platform API of your information management system, if uploading and processing 500000 data records in one session is a normal operation (C# .NET 3.5 + MS SQL 2005)?

I’m specifically interested in production-proven paging patterns that behave well with the concurrency, scalability and reliability.

Does anybody have any ideas, in what direction to dig?

  • Open Source Projects (don’t care about the language or platform, as long as it is not Ook)
  • books
  • articles
  • Google keywords
  • forums or newsgroups

Any help would greatly appreciated!

Update:

  • simple paging (i.e.: rownumber in SQL 2005) does not work, since there are a lot of concurrent changes to the database. Item, that is deleted or inserted between the page requests, automatically makes current page index invalid.
+2  A: 

This is a good book to start with:

Patterns of Enterprise Application Architecture by Martin Fowler

Jon Limjap
+2  A: 

When it comes to DB optimization for huge amount of data you’ll most probably benefit from using “BigTable” technique. I found article here very useful. Shortly the idea is to use DB denormalization to trade disk space for better performance.

For paging in MS SQL 2005 you’ll want to find more info on using ROW_NUMBER function. Here is just a simple example, you’ll find tons of them using google (keywords: ROW_NUMBER paging SQL 2005). Do not dig to much though – there is no magic in implementation, rather in how are you going to use/present the paging itself. Google search is a good example.

Note: we found NHibernate framework native paging support not sufficient for our solution.

Also you’ll probably be interested in creating FULLTEXT index and using full text search. Here is MSDN article on creating full text index, and some info on full text search.

Good luck.

Dandikas
Just be aware that denormalization introduces more problems than just additional disk space usage. There's also the problem of duplicate data that needs to be kept in sync as well as other issues. Make sure that you understand the trade-offs.
Tom H.
A: 

dandikas,

thank you for mentioning the partial denormalization. Yes, that's the approach I'm considering for improving performance of some queries.

Unfortunately, NHibernate ORM does not fit into the solution, due to the performance overhead it adds. Same with the SQL paging - it does not work in the scenario of numerous concurrent edits (as detected by the stress-testing)

Rinat Abdullin
A: 

I look after an enterprise data warehouse which uploads some feeds of hundreds of thousands of records.
I'm not sure if this is your scenario, but we:

  • Receive text files whch we upload to a Sybase database.
  • Format the different feeds using awk so they're in a common format.
  • Load them into a denormalised intermediate table using bcp.
  • Running stored procedures to populate the normalised database structre.
  • Delete from the denormalised intermediate table.

This runs fairly well, but we force our uploads to be sequential. I.e. when feeds arrive they go into a queue, and we process the feed at the head of the queue entirely before looking at the rest.

Is any of that helpful?

AJ
A: 

Same with the SQL paging - it does not work in the scenario of numerous concurrent edits (as detected by the stress-testing)

As I mentioned, there is no magic in implementing paging – you either use ROW_NUMBER or a temporary table. The magic here is in evaluating what is your most common real world usage scenario. Using temporary table along with user tracking might help a bit in overcoming concurrent edits scenario. Though I sense that you’ll win more by answering questions:

  1. How long user stays on one page before moving to another one?
  2. How often user moves from first to any other page?
  3. What is the common pages count that user will look through?
  4. How critical it is if some information changes while user is moving from one page to another and back?
  5. How critical it is if some information gets deleted while user is on the page that shows the information?

Try not to concentrate on question like: “How to handle any possible concurrent edits scenario while paging?” before you answer above questions first and then handle only situations that really matter.

Another note is UI. Check out as much paging UI as you can find, as there are much better solutions than just right and left arrows, or lined up page numbers. Some of solutions help to hide/overcome technically not solvable paging scenarios.

P.S. If this answer is useful I’ll combine it with my first one.

Dandikas
Thank you for extensive comment. Yet, it is different.I'm talking about the cross-platform API in the post, not the UI.Imagine the situation, where 500000 records are uploaded/deleted within 5-10 minutes by one customer. And same records are being paged by the automation service concurrently.
Rinat Abdullin
+1  A: 

Done the implementation. I have been informed recently that one of the uploads was about 2148849 records. Tiers did successfully deal with the a couple of broken connections and dozens of deadlocks at the DB level during this upload.

In case somebody else needs some info:

Rinat Abdullin