views:

136

answers:

4

Input : Pairs of From->To Rows.

From  To
 1     2
 2     3
 3     4
 6     7

Output: For Each From Value, pairs of reachable To values. E.g. for 1

Source Reachable
 1      2
 1      3
 1      4

Obviously, one can suck out the data to a Graph structure and run DFS scan.

Is there a alternative way to do so, such that:

  1. Uses SQL/Functional Style instead of imperative programming?
  2. Fast enough for 10 million rows. (Current graph approach in C#/SSIS runs in ~2 hrs)
+2  A: 

Using a CTE (Common Table Expressions) recursively sounds like the right answer here. Have a look here for a similar situation involving date ranges.

John Fisher
looks like rCTE does the work, will check back on real data tomorrow and update thread
DotDot
CTEs work? That means you're using SQL Server 2005 at least. 2008 has better hierarchical syntax...
OMG Ponies
one problem - CTE gets stuck in infinite loop that a DFS will not get into, due to color changes. Can we achive that?
DotDot
Why don't you post the query to a different question and see if people can help fix it up?
John Fisher
+1  A: 

How about this:

First run: make hashes.

h[1] = 2
h[2] = 3
h[3] = 4
h[6] = 7

Second run: for each key, see if it is unprocessed (i will explain), if yes then do a change run and output the reachability:

h[1] = 2 (unprocessed) --> output "1 2"
  h[2] = 3 (unprocessed) --> output "1 3"
    h[3] = 4 (unprocessed) --> output "1 4"
      h[4] = null

Now we store the computed (processed) results to speedup future lookups (as in dynamic programming):

h[1] = 2,3,4,
h[2] = 3,4,
h[3] = 4,

And so on.

Extreme case scenarios:

  1. No values are used as keys. In the second run we will have two lookups per key.
  2. It is a single chain. Then in second run, after h[1] is evaluated, rest is just picking up the computed values.

Not sure about the actual speed of execution, needs testing.

Joy Dutta
The database is always going to be faster at this than in a 3rd gen language.
OMG Ponies
SCANs are expensive business when it comes to performance.
Faiz
A: 

A DBMS is designed for processing relational information/record sets and not for DFS like hieararchical approach. When it comes to processing hieararchical information and you need performance, it is always better to get the job out and done by an external code written in some 3rd generation language. Is it okay to use manged (CLR) SQL functions or a script task inside SSIS for your particular requirement?

Faiz
A: 

You should combine:

  • Batch processing
  • Functional programming style
  • Clustering (Shared nothing => Map Reduce)
Martin K.