views:

93

answers:

2
+1  Q: 

Directed graph SQL

I have the following data set, which represents nodes in a directed graph.

CREATE TABLE nodes (NODE_FROM VARCHAR2(10),
                    NODE_TO VARCHAR2(10));

INSERT INTO nodes VALUES('GT','TG');
INSERT INTO nodes VALUES('GG','GC');
INSERT INTO nodes VALUES('AT','TG');
INSERT INTO nodes VALUES('TG','GC');
INSERT INTO nodes VALUES('GC','CG');
INSERT INTO nodes VALUES('TG','GG');
INSERT INTO nodes VALUES('GC','CA');
INSERT INTO nodes VALUES('CG','GT');

Visual representation: http://esser.hopto.org/temp/image1.JPG

Using this data set, I want a user to enter a level (e.g. 2) and this returns all nodes 2 "hops" away from a specific node):

NODE_FROM  NODE_TO

TG        GC
TG        GG
AT        TG
GT          TG

http://esser.hopto.org/temp/image2.JPG

My current attempt looks like this:

SELECT node_from, node_to
  FROM nodes
  WHERE level <= 2   -- Display nodes two "hops" from 'AT'
START WITH node_from = 'AT'
CONNECT BY NOCYCLE PRIOR node_to = node_from
    OR    node_to = PRIOR node_from
GROUP BY node_from, node_to;

http://esser.hopto.org/temp/image3.JPG

As you can see, the relationship: GT -> TG is missing.

A: 

Sounds like you need to get a copy of Joe Celko's Trees and Hierarchies in SQL for Smarties.

Tony
+2  A: 

So your graph looks like this:

alt text You can use Oracle's START WITH/CONNECT BY feature to do what you want. If we start at node GA, we can reach all nodes in the graph, as shown below.

CREATE TABLE edges (PARENT VARCHAR(100), CHILD VARCHAR(100));

insert into edges values ('AT', 'TG');
insert into edges values ('CG', 'GT');
insert into edges values ('GA', 'AT');
insert into edges values ('GC', 'CA');
insert into edges values ('GC', 'CG');
insert into edges values ('GG', 'GC');
insert into edges values ('GT', 'TG');
insert into edges values ('TG', 'GA');
insert into edges values ('TG', 'GC');
insert into edges values ('TG', 'GG');
COMMIT;

SELECT *
  FROM edges
START WITH CHILD = 'GA'
CONNECT BY NOCYCLE PRIOR CHILD = PARENT;

Output:

    PARENT  CHILD
1   TG      GA
2   GA      AT
3   AT      TG
4   TG      GC
5   GC      CA
6   GC      CG
7   CG      GT
8   CG      GT
9   GC      CA

NOTE Since your graph has cycles, it's important to use the NOCYCLE syntax on the CONNECT BY, otherwise this won't work.

EDITED ANSWER BASED ON LATEST EDITS BY OP

First of all, I assume that by "2 hops" you mean "at most 2 hops", because your current query is using level <= 2. If you want exactly 2 hops, it should be level = 2.

In your updated graph (image2.JPG), there is no path from AT to GT that takes 2 hops, so the query is returning what I would expect. From AT to GT, we can go AT->TG->GC->CG->GT, but that's 4 hops, which is greater than 2, so that's why you aren't getting that result back.

If you are expecting to be able to reach AT to GT in 2 hops, then you need to add an edge between TG and GT, like this:

INSERT INTO nodes VALUES('TG','GT');

Now when you run your query, you'll get this data back:

NODE_FROM NODE_TO AT TG TG GC TG GG TG GT

Remember that START WITH/CONNECT BY is going to only work if there is a path between the nodes. In your graph (before I added the new edge above), there is no path for AT->TG->GT, so that's why you're not getting the result back.

Now, if you added the edge TG->AT, then we would have the path GT->TG->AT. So in that case AT is 2 hops away from GT (i.e. we're going the reverse way now, starting from GT and ending at AT). But to find those paths, you would need to set START WITH node_from = 'GT'.

If your goal is to find all paths from a start node to any target node that is level <= 2 hops or less away, then the above should work.

However, if you want to all find all paths from some target node back to a source node (i.e. the reverse example I gave, from GT->TG->AT), then that's not going to work here. You'd have to run the query for all nodes in the graph.

Think of START WITH/CONNECT BY as doing a depth first search. It's going to go everywhere it can from a starting node. But it's not going to do any more than that.

Summary:

I think the query works fine, given the constraints above. I've explained why the GT-TG path is not returned, so I hope that makes sense.

Keep in mind, however, if you are trying to traverse reverse paths as well, you'll have to loop over every node and run the query, changing the START WITH node each time.

dcp
@idea_ You're welcome. I have edited my answer based on your latest example. Let me know if you still have questions or need clarifications.
dcp