views:

73

answers:

3

I want to distribute a large amount of data to different C# applications. For example, my table contains millions of records. I would like to specify that first 3 millions records are processed by App1 and next 3 million in another C# application App2 and so on. Table rows are deleted and added as per requirement. Now I want to write a SQL query that will process first 3 million records. Now if 5 records are deleted from app1 then app1 must fetch next 5 records from app2 and app2 from app3. So that data always remain constant in each app.

I have used limit in the SQL query, but I didn't get the required output. How can I write the SQL query for this and how should I design the C# application.

+1  A: 

That sounds like a really bad idea. Requesting a limit of 3 million records is a very slow operation.

An alternative approach would be to have an instance number column and have each instance of your application reserve rows as it needs them by writing its instance number into this column. Process your data in smaller chunks if possible.

Adding an index to the instance number column will allow you to count how many rows you have already handled and also to find the next batch of 1000 (for example) that haven't been assigned to any instance yet.

Mark Byers
+4  A: 

It looks a bit as if you want to take over from the database and do the processing that a database is tasked and tailored to do, in your own application. You talk of an SQL query with a LIMIT statement. Don't use that. Millions of records is not much in database terms. If you have performance issues, you may need to index your table or revisit the query design (watch its execution path for performance issues).

If you really cannot let the database do the task and you need to process them one by one in your application, the network latency and bandwidth is likely to be an earlier candidate for performance issues, which you won't make any faster by using multiple apps (let alone the cost of such queries).

If my observations are wrong and your processing of the records must take place outside the database and the network is not a bottleneck, nor are the processors or the database machine and multiple applications will provide a performance gain, then I suggest you create a dispatch application that processes the records and makes them available to your other applications (or better: threads) through normal POCOs. This creates a much easier way of spreading the processing and the dispatch application (or thread) can work as some kind of funnel for your processing applications.

However, look at the cost / benefit equation: is the trouble really going to gain you some performance, or is it better to revisit your design and find a more practical solution?

Abel
A: 

I would beneift from more understanding of the details of the application and the process to get, select, delete, etc. However, to give it a shot to a viable answer.

In short, use partitioned tables and distributed views. Each application is "keyed" to those tables through the common partitioned view, if any application has to act on another table (or "key") it can use the same view and act on the other tables.

In more detail ...

If you have the Enterprise or Developer edition of SQL Server, or any other that provides distributed views, then you can create three or more tables with a partitioned column ("App1", "App2", "App3"), like what Mark Byers has said, that would then distribute the ability to process against data evenly.

Now create a view (WITH SCHEMABINDING) for "Select Field1, Field2, Field3, etc. from table1 UNION Select Field1, Field2, Field3, etc. from table2 UNION Select Field1, Field2, Field3, etc. from table3"

Create a unique clustered key on the one/two fields that uniquly represent your data. When this is done, you can now select/delete/update from view where partitioncolumn = "app1" and "id=?". This routine makes it that the activity on the view is allowed for action queries (insert/update/delete) and only acts on the table of the partitioned data.

So, App1, sends an "App1" WHERE filter, and the db engine only acts on table1 even though the view has.

SnapJag