views:

120

answers:

4

Hey. consider d following table and data...

in_timestamp | out_timestamp | name  | in_id | out_id | in_server | out_server | status
timestamp1   | timestamp2    | data1 |id1   | id2    | others-server1   | my-server1 | success
timestamp2   | timestamp3    | data1 | id2   | id3    | my-server1   | my-server2 | success
timestamp3   | timestamp4    | data1 | id3   | id4    | my-server2   | my-server3 | success
timestamp4   | timestamp5    | data1 | id4   | id5    | my-server3   | others-server2 | success
  • the above data represent log of a execution flow of some data across servers.
  • e.g. some data has flowed from some 'outside-server1' to bunch of 'my-servers' and finally to destined 'others-server2'.

Question :

1) I need to give this log in representable form to client where he doesn't need to know anything about the bunch of 'my-servers'. All i am supposed to give is timestamp of the data entered my infrastructure and when it left; drilling down to following info.

in_timestamp (of 'others_server1' to 'my-server1')
out_timestamp (of 'my-server3' to 'others-server2')
name 
status

I want to write sql for the same! Can someone help? NOTE : there might not be 3 'my-servers' all the time. It differs from situation to situation. e.g. there might be 4 'my-server' involved for, say, data2!

2) Are there any other alternatives to SQL? I mean stored procs/etc?

3) Optimizations? (The records are huge in number! As of now, it is around 5 million a day. And we are supposed to show records that are upto a week old.)

In advance, THANKS FOR THE HELP! :)

A: 

One way of doing it - if the data is STABLE (e.g. never changes onc inserted) is to compute the transitive relationships ON THE FLY (e.g. via a trigger or by the app which does the insertion) at the insert time.

E.g. you have a new column "start_ts" in your table; when you insert a record:

in_timestamp | out_timestamp | name  | in_id | out_id | in_server | out_server | status
timestamp3   | timestamp4    | data1 | id3   | id4    | my-server2   | my-server3 | success

... then your logic automatically finds the record with name=data1 and out_id=id3 and clones its start_ts into the newly inserted record. You may need some special logic around propagating last status as well depending on how you compute those transitive values.

BTW, you need not necessarily have to look up the previous (name=data1 and out_id=id3) record - you can persist the start_ts value in the data record's meta data itself while processing.

Then the final report is simply select start_ts, out_ts from T where out_server=others_server2 (of course more complicated as far as out_server and status, but still a single simple select)

A second option is of course the more straightforward loop computing the resulting report - google or "stack" (is that an accepted verb now?) for SQL BFS implementations if you're not sure how.

DVK
FIRST OPTION : The problem of generating in memory calculation is that each of those logs are generated @ different 'my-servers' but each of those 'my-servers' persist in centralized DB. So i can imagine real-time distributed computing but i am not that expertise to try it out... DB thingy sounds a bit possible to me.. :D SECOND OPTION : Again, i want to avoid doing this @ app. level reason being there is a huge chunk of data in DB. I want to avoid loading all those to application and then processing. :)Correct me if i have misunderstood u or i misunderstood ur comments! :)
MiKu
+1  A: 

Your table has a heirarchical structure (adjacency lists). This can be queried efficiently in PostgreSQL v8.4 and later using recursive CTEs. Quassnoi has written a blog post about how to implement it. It is a quite complex query that you need to write but he explains it well with examples very similar to what you need. Especially if you look at his last example, he demonstrates a query than gets the complete path from the first node to the last by using an array.

Mark Byers
A: 
  • @Other Readers :

    Refer to 1st answer posted by Mark Byers first. I used 'answering' rather than 'commenting' his post since i needed to use tables/links etc. which is not available while commenting on answers. :)

  • @Mark Byers :

Thanks for the link... It really helped me and i was able to figure out the way to generate the path between the servers... Have a look @ what i was able to do.

in_id   | in_timestamp  | out_timestmap | name  | hops_count    | path  |
id1     | timestamp1    | timestamp2    | data1 | 1             | {id1} |
id2     | timestamp2    | timestamp3    | data1 | 2             | {id1,id2} |
id3     | timestamp3    | timestamp4    | data1 | 3             | {id1,id2,id3} |
id4     | timestamp4    | timestamp2    | data1 | 4             | {id1,id2,id3,id4} |

* path is generated using 'in_id'

I used the following query...

WITH RECURSIVE foo AS
        (
        SELECT  t_alias1, 1 AS hops_count, ARRAY[in_id] AS hops
        FROM    log_parsing.log_of_sent_mails t_alias1
        WHERE   in_server = 'other-server1'
        UNION ALL
        SELECT  t_alias2, foo.hops_count + 1 AS hops_count, hops || in_id
        FROM    foo
        JOIN    log_parsing.log_of_sent_mails t_alias2
        ON      t_alias2.in_id = (foo.t_alias1).out_id 
        )
SELECT  (foo.t_alias1).in_id,
        (foo.t_alias1).name,
        (foo.t_alias1).in_timestamp,
        hops_count,
        hops::VARCHAR AS path
FROM    foo   ORDER BY   hops

But i could not reach the ultimate stage yet. Here is what i wish to get ultimately...

in_id   | in_timestamp  | out_timestmap | name  | hops_count    | path  |
id4     | timestamp1    | timestamp5    | data1 | 4             | {id1,id2,id3,id4}|

* observe the timestamp. Its required since i do not wish the client to know about the internal infrastructure. So for him the time-lag between timestamp1 and timestamp5 is what matters.

Any clue how possibly i could achieve it!?

p.s. I would try contacting Quassnoi too. :)

MiKu
+2  A: 
WITH RECURSIVE foo AS
        (
        SELECT  *, in_timestamp AS timestamp1, 1 AS hop, ARRAY[in_id] AS hops
        FROM    log_parsing.log_of_sent_mails
        WHERE   in_server = 'other-server1'
        UNION ALL
        SELECT  t_alias2.*, foo.timestamp1, foo.hop + 1, hops || in_id
        FROM    foo
        JOIN    log_parsing.log_of_sent_mails t_alias2
        ON      t_alias2.in_id = (foo.t_alias1).out_id 
        )
SELECT  *
FROM    foo
ORDER BY
        hop DESC
LIMIT 1
Quassnoi