views:

988

answers:

5

Consider the following simple DAG:

  1->2->3->4

And a table, #bar, describing this (I'm using SQL Server 2005):

parent_id   child_id
1           2
2           3
3           4
//... other edges, not connected to the subgraph above

Now imagine that I have some other arbitrary criteria that select the first and last edges, i.e. 1->2 and 3->4. I want to use these to find the rest of my graph.

I can write a recursive CTE as follows (I'm using terminology from MSDN):

with foo(parent_id,child_id) as (
// anchor member that happens to select first and last edges:
select parent_id,child_id from #bar where parent_id in (1,3)
union all
// recursive member:
select #bar.* from #bar
join foo on #bar.parent_id = foo.child_id
)
select parent_id,child_id from foo

However, this results in edge 3->4 being selected twice:

parent_id  child_id
1          2
3          4
2          3
3          4    // 2nd appearance!

How can I prevent the query from recursing into subgraphs that have already been described? I could achieve this if, in my "recursive member" part of the query, I could reference all data that has been retrieved by the recursive CTE so far (and supply a predicate indicating in the recursive member excluding nodes already visited). However, I think I can access data that was returned by the last iteration of the recursive member only.

This will not scale well when there is a lot of such repetition. Is there a way of preventing this unnecessary additional recursion?

Note that I could use "select distinct" in the last line of my statement to achieve the desired results, but this seems to be applied after all the (repeated) recursion is done, so I don't think this is an ideal solution.

Edit - hainstech suggests stopping recursion by adding a predicate to exclude recursing down paths that were explicitly in the starting set, i.e. recurse only where foo.child_id not in (1,3). That works for the case above only because it simple - all the repeated sections begin within the anchor set of nodes. It doesn't solve the general case where they may not be. e.g., consider adding edges 1->4 and 4->5 to the above set. Edge 4->5 will be captured twice, even with the suggested predicate. :(

+1  A: 

The CTE's are recursive.

When your CTE's have multiple initial conditions, that means they also have different recursion stacks, and there is no way to use information from one stack in another stack.

In your example, the recursion stacks will go as follows:

(1) - first IN condition
(1, 2)
(1, 2, 3)
(1, 2, 3, 4)
(1, 2, 3) - no more children
(1, 2) - no more children
(1) - no more children, going to second IN condition

(3) - second condition
(3, 4)
(3) - no more children, returning

As you can see, these recursion stack do not intersect.

You could probably record the visited values in a temporary table, JOIN each value with the temptable and do not follow this value it if it's found, but SQL Server does not support these things.

So you just use SELECT DISTINCT.

Quassnoi
I only have one initial condition but it can select multiple rows. Do you have a link to any documentation that states that each row is treated with independent stacks? (I presume I could emulate the desired effect using my own temp tables - not tried yet)I don't have enough 'reputation' to vote up your answer >:(
bacar
aha, you revised your answer to suggest temp table while I was commenting :)
bacar
+1  A: 

Do you happen to know which of the two edges is on a deeper level in the tree? Because in that case, you could make edge 3->4 the anchor member and start walking up the tree until you find edge 1->2.

Something like this:

with foo(parent_id, child_id)
as
(
    select parent_id, child_id
    from #bar
    where parent_id = 3

    union all

    select parent_id, child_id
    from #bar b
    inner join foo f on b.child_id = f.parent_id
    where b.parent_id <> 1
)
select *
from foo
Ronald Wildenberg
I'm afraid I don't - I generally don't even know whether the edges selected in the initial query might belong to the same overall graph, or not :(Thanks though!
bacar
A: 

(I'm no expert on graphs, just exploring a bit)

The DISTINCT will guarantee each row is distinct, but it won't eliminate graph routes that don't end up in your last edge. Take this graph:

insert into #bar (parent_id,child_id) values (1,2)
insert into #bar (parent_id,child_id) values (1,5)
insert into #bar (parent_id,child_id) values (2,3)
insert into #bar (parent_id,child_id) values (2,6)
insert into #bar (parent_id,child_id) values (6,4)

The results of the query here include (1,5), which is not part of the route from the first edge (1,2) to the last edge (6,4).

You could try something like this, to find only routes that start with (1,2) and end with (6,4):

with foo(parent_id, child_id, route) as (
    select parent_id, child_id, 
        cast(cast(parent_id as varchar) + 
        cast(child_id as varchar) as varchar(128))
    from #bar
    union all
    select #bar.parent_id, #bar.child_id, 
        cast(route + cast(#bar.child_id as varchar) as varchar(128)) 
    from #bar
    join foo on #bar.parent_id = foo.child_id
)
select * from foo where route like '12%64'
Andomar
I don't want to eliminate routes that don't end up in my 'last' edge (I want everything that connects to any of the edges specified in the anchor) - but I agree that this is an interesting (separate) problem :)
bacar
If you add (6,1), would that count as being connected?
Andomar
Yes, but then it gets cyclic, and you have a whole set of other problems to worry about :)
bacar
(Note that an ability to track nodes you've already visited would actually help in not infinitely looping through cyclic graphs, too!)
bacar
Heh, I meant (6,1) to your original example! Or say, (7,1) to mine. Or is it given that (1,2) is an edge node and can't have entries before it?
Andomar
Oh, right! Sorry. Then no, I don't want to select that. Sorry, I've been rather lax with my terminology - I only want to select the subgraphs that are "downstream" of the initial set of nodes.
bacar
Thanks, I guess that's implicitly clear when people talk about graphs (I oughta read a book about them some day!)
Andomar
+1  A: 

Is this what you want to do?

create table #bar (parent_id int, child_id int)
insert #bar values (1,2)
insert #bar values (2,3)
insert #bar values (3,4)

declare @start_node table (parent_id int)
insert @start_node values (1)
insert @start_node values (3)

;with foo(parent_id,child_id) as (
    select
        parent_id
        ,child_id
    from #bar where parent_id in (select parent_id from @start_node)

    union all

    select
        #bar.*
    from #bar
        join foo on #bar.parent_id = foo.child_id
    where foo.child_id not in (select parent_id from @start_node)
)
select parent_id,child_id from foo

Edit - @bacar - I don't think this is the temp table solution Quasnoi was proposing. I believe they were suggesting basically duplicate the entire recursion member contents during each recursion, and use that as a join to prevent reprocessing (and that this is not supported in ss2k5). My approach is supported, and the only change to your original is in the predicate in the recursion member to exclude recursing down paths that were explicitly in your starting set. I only added the table variable so that you would define the starting parent_ids in one location, you could just as easily have used this predicate with your original query:

where foo.child_id not in (1,3)
Yes - this is an implementation of the "temp table" solution suggested by QuassnoiThanks!
bacar
@bacar - please see my edit to the post
OK, I see now - that works for this simple case where all the repeated parts are contained within the initial set - it doesn't prevent repeatedly recursing into _other_ subgraphs that are shared by, but not included directly within the initial set.e.g. consider graph:insert #bar values (1,2)insert #bar values (2,3)insert #bar values (1,3)insert #bar values (3,4)And initial start node of 1 only.Your query would still pick out edge 3->4 twice, because I can't insert the nodes I've visited into the table variable in the recursive member part (as you mention).:(
bacar
hrrmph, apologies for formatting, didn't realise it's v limited in comments...
bacar
@bacar - thanks for the helpful example, you may want to edit your question to incorporate that. You can't save any kind of state with the CTE, so I think you may have to go with an iterative tsql approach or something with the CLR.
Done - have tried to keep it as simple as possible though. Thanks!
bacar
A: 

This is the approach I used. It has been tested against several methods and was the most performant. It combines the temp table idea suggested by Quassnoi and the use of both distinct and a left join to eliminate redundant paths to the recursion. The level of the recursion is also included.

I left the failed CTE approach in the code so you could compare results.

If someone has a better idea, I'd love to know it.

create table #bar (unique_id int identity(10,10), parent_id int, child_id int)
insert #bar  (parent_id, child_id)
SELECT 1,2 UNION ALL
SELECT 2,3 UNION ALL
SELECT 3,4 UNION ALL
SELECT 2,5 UNION ALL
SELECT 2,5 UNION ALL
SELECT 5,6

SET NOCOUNT ON

;with foo(unique_id, parent_id,child_id, ord, lvl) as (
    -- anchor member that happens to select first and last edges:
    select unique_id, parent_id, child_id, row_number() over(order by unique_id), 0
    from #bar where parent_id in (1,3)
union all
-- recursive member:
select b.unique_id, b.parent_id, b.child_id, row_number() over(order by b.unique_id), foo.lvl+1
    from #bar b
    join foo on b.parent_id = foo.child_id
)
select unique_id, parent_id,child_id, ord, lvl from foo

/***********************************
    Manual Recursion
***********************************/
Declare @lvl as int
Declare @rows as int
DECLARE @foo as Table(
    unique_id int,
    parent_id int,
    child_id int,
    ord int,
    lvl int)

--Get anchor condition
INSERT @foo (unique_id, parent_id, child_id, ord, lvl)
select unique_id, parent_id, child_id, row_number() over(order by unique_id), 0
    from #bar where parent_id in (1,3)

set @rows=@@ROWCOUNT
set @lvl=0

--Do recursion
WHILE @rows > 0
BEGIN
    set @lvl = @lvl + 1

    INSERT @foo (unique_id, parent_id, child_id, ord, lvl)
    SELECT DISTINCT b.unique_id, b.parent_id, b.child_id, row_number() over(order by b.unique_id), @lvl
    FROM #bar b
     inner join @foo f on b.parent_id = f.child_id
     --might be multiple paths to this recursion so eliminate duplicates
     left join @foo dup on dup.unique_id = b.unique_id
    WHERE f.lvl = @lvl-1 and dup.child_id is null

    set @rows=@@ROWCOUNT 
END

SELECT * from @foo

DROP TABLE #bar
Laramie