views:

606

answers:

4

I have a large database and want to implement a feature which would allow a user to do a bulk update of information. The user downloads an excel file, makes the changes and the system accepts the excel file.

  1. The user uses a web interface (ASP.NET) to download the data from database to Excel.
  2. User modifies the Excel file. Only certain data is allowed to be modified as other map into the DB.
  3. Once the user is happy with their changes they upload the changed Excel file through the ASP.NET interface.
  4. Now it's the server's job to suck data from the Excel file (using Gembox) and validate the data against the database (this is where I'm having the trouble)
  5. Validation results are shown on another ASP.NET page after validation is complete. Validation is soft and so hard fails only occur when say an index mapping into DB is missing. (Missing data causes ignore, etc)
  6. User can decide whether the actions that will be taken are appropriate, in accepting these the system will apply the changes. (Add, Modify, or Ignore)

Before applying the changes and/or additions the user has made, the data must be validated to avoid mistakes by the user. (The accidentally deleted dates which they didn't mean to)

It's not far fetched for the rows that need updating to reach over 65k.

The question is: What is the best way to parse the data to do validation and to build up the change and addition sets?

If I load all data that the excel data must be validated against into memory I might unnecessarily be affecting the already memory hungry application. If I do a database hit for every tuple in the excel file I am looking at over 65k database hits.

Help?

A: 

To answer this properly the following information would be useful

  1. How are you going to notify the user of failures?
  2. Will one validation failure result in loading 64,999 records or none?
Dave Barker
1. Through an ASP.NET GUI. 2. Soft fail. 1 validation failure means 65k-1 succeed.
vanslly
+3  A: 

The approach I've seen used in the past is:

  1. Bulk-load the user's data into a 'scratch' table in the database.
  2. Validate data in the scratch table via a single stored procedure (executing a series of queries), marking rows that fail validation, require update etc.
  3. Action the marked rows as appropriate.

This works well for validating missing columns, valid key values etc. It's not so good for checking the format of individual fields (don't make SQL pull strings apart).

As we know, some folk feel uncomfortable putting business logic in the database, but this approach does limit the number of database hits your application makes, and avoids holding all the data in memory at once.

lukef
To load the data into the database would this not require ~65k INSERT statements then?
vanslly
@vanslly - most dbs have a bulk load function - in SQL Server you can use SqlBulkCopy, for example. This uses a TDS stream for the fastest possible transfer into the scratch table.
Marc Gravell
+1  A: 

Your problem is very common in Data Warehouse systems, where bulk uploads and data cleansing are a core part of the (regular) work to be done. I suggest you google around ETL (Extract Transform Load), Staging tables and you'll find a wealth of good stuff.

In broad answer to your problem, if you do 'load the data into memory' for checking, you're effectively re-implementing a part of the DB engine in your own code. Now that could be a good thing if it's faster and clever to do so. For instance you may only have a small range of valid dates for your Excel extract, so you don't need to join to a table to check that dates are in range. However, for other data like foreign keys etc, let the DB do what it's good at.

Using a staging table/database/server is a common solution as the data volumes get large. BTW allowing users to clean data in Excel is a really good idea, allowing them to 'accidentally' remove crucial data is a really bad idea. Can you lock cells/columns to prevent this, and/or put in some basic validation into Excel. If a field should be filled and should be a date, you can check that in a few lines of excel. Your users will be happy as they don't have to upload before finding problems.

MrTelly
A: 

first store in a temp table from text file data using bulk uploading. then retrives this, and validate using your made interface. and after validation store it in the main table or DB