ansaurus

Question

Refining data stored in SQLite - how to join several contacts?

Answer 1

+2 A:

There's one difficulty with your explanation.

What you start with is a directed graph where each edge represents a connection X=>Y where X is a donor and Y an acceptor. The table atoms is the SQL representation of that graph.

What you seem to want is something that is undirected. So that a link X-Y means that X and Y are linked via a water molecule (or some other species of course) but that X and Y could both be donors or acceptors. For this reason your last table has an ambiguity (that you note) so that some links could occur either way around. It seems to me that means that the column headings donor_id and acceptor_id on your final column don't have any meaning that you have explained. This may be my confusion of course.

If all you want is a table with all 6 links in it, each as one row, but don't worry too much about keeping track of the donor/acceptor thing then this works for me in sqlite3:

 create temporary view hoh_view as 
 select donor_id as id, atoms.id as hoh_id from contacts, atoms 
       where acceptor_id=atoms.id and atoms.amino='HOH' 
 union select acceptor_id as id, atoms.id as hoh_id from contacts, atoms 
       where donor_id=atoms.id and atoms.amino='HOH';

 select a.id, b.id from hoh_view as a, hoh_view as b 
       where a.id > b.id and a.hoh_id=b.hoh_id;

Where I have used a temporary view to make things clearer. You can put this all into one big query if you like by replacing each reference to hoh_view by the first query. It feels a bit nasty to me and there may be a way of tidying it up.

If you do want to keep track of donor/acceptor relationships you need to explain how you decide what to do when both amino acids are acceptors or donors (i.e. the last two rows in your example).

If that doesn't do what you want, then maybe I can fix it up so it does.

Francis Davey 2010-05-25 19:42:17

SQL isn't very good at representing graphs because its relational.

Francis Davey 2010-05-25 19:43:10

Thank you. The original connections are directed. Many of my new connections can be directed too (A donor, D acceptor). How would you recommend to represent undirected data? Because I could do without the information about the direction of the edge.

Krab 2010-05-25 20:10:03

Unfortunately, your code doesn't work as I need. It doesn't care about the fact that the water must be the same - it would create connections between any atoms connected to any water in my db which satisfy a.id > b.id. :-( Would grouping the last select help somehow? There is also an alternative - if it's easier, the task could be solved by removing one of the undirected connections (C-D, D-C). But again, I have no idea how to do it reliably. Thank you for your effort

Krab 2010-05-25 20:29:54

@Krab: the original data lacks any grouping information, and Fransis suggested a solution that does not use any grouping. If the information is present, both `WHERE` clauses will include an additional condition. `a.id > b.id` introduce a directed connection based on id value and eliminating duplicates like (C-D, D-C).

newtover 2010-05-25 20:44:51

@newtower: The original data contain the grouping information implicitly. Imagine you have side by side twice the same problem as depicted - having original atoms as well as O1, A1, B1, C1, D1. Now, the refined db should contain only 12 connections, 6 for each water, not connections like A1-C. Or maybe I just didn't understand how would you make an additional condition.

Krab 2010-05-25 20:54:36

@Krab - you are quite right. What I originally had in mind was a top level select on HOH elements and then using that to direct the later searches. Let me see if I can think of something.

Francis Davey 2010-05-25 21:29:29

I've made a correction to permit only the same HOH, which I hope works. If you have some data in insert form that would be great.

Francis Davey 2010-05-25 21:49:43

@Donnie: I got the database by parsing a text file having one contact per row. I saved it to a db separating info about contacts and atoms into two tables (as atoms can occur in multiple contacts). How would you recommend to store this data better?

Krab 2010-05-25 21:54:27

@Francis Davey: This seems to work like I need. I will test it further tomorrow, but it looks all right. Thank you very much.

Krab 2010-05-25 22:12:37

If you are interested in a sample of my data in sql insert format, look at http://web.vscht.cz/sedlakf/contacts.sql

Krab 2010-05-25 22:15:13

Answer 2

+1 A:

Well, its hard to provide examples in comments, I decided to post an answer:

If you have to following original data, there is no way to distinguish data from the first structure from those of the second. There should be an additional grouping condition to eleminate directions between the first and the second structure.

sqlite> create table atoms (id INT, atom TEXT, amino TEXT);
sqlite> insert into atoms VALUES (1, 'O', 'HOH');
sqlite> insert into atoms VALUES (2, 'A', 'ARG');
sqlite> insert into atoms VALUES (3, 'B', 'CYS');
sqlite> insert into atoms VALUES (4, 'C', 'SER');
sqlite> insert into atoms VALUES (5, 'D', 'ARG');
sqlite> insert into atoms VALUES (6, 'O1', 'HOH');
sqlite> insert into atoms VALUES (7, 'A1', 'ARG');
sqlite> insert into atoms VALUES (8, 'B1', 'CYS');
sqlite> insert into atoms VALUES (9, 'C1', 'SER');
sqlite> insert into atoms VALUES (10, 'D1', 'ARG');
sqlite> select * from atoms;
1|O|HOH
2|A|ARG
3|B|CYS
4|C|SER
5|D|ARG
6|O1|HOH
7|A1|ARG
8|B1|CYS
9|C1|SER
10|D1|ARG

UPD

Here is the original data:

sqlite> .headers ON
sqlite> .mode columns
sqlite> select * from atoms;
id          atom        amino
----------  ----------  ----------
1           O           HOH
2           A           ARG
3           B           CYS
4           C           SER
5           D           ARG
6           O1          HOH
7           A1          ARG
8           B1          CYS
9           C1          SER
10          D1          ARG
sqlite> select * from contacts;
donor_id    acceptor_id  directness
----------  -----------  ----------
1           4            D
1           5            D
2           1            D
3           1            D
6           9            D
6           10           D
7           6            D
8           6            D

Here is the query:

select
    c1.donor_id, c2.acceptor_id, 'W' as directness
from
    contacts c1, contacts c2, atoms a
where
    c1.acceptor_id = c2.donor_id
    and c1.acceptor_id=a.id
    and a.amino='HOH'
UNION ALL
select
    c1.donor_id, c2.donor_id, 'X' as directness
from
    contacts c1, contacts c2, atoms a
where
    c1.acceptor_id = c2.acceptor_id
    and c1.acceptor_id=a.id
    and a.amino='HOH'
    and c1.donor_id < c2.donor_id
UNION ALL
select
    c1.acceptor_id, c2.acceptor_id, 'X' as directness
from
    contacts c1, contacts c2, atoms a
where
    c1.donor_id = c2.donor_id
    and c1.donor_id=a.id
    and a.amino='HOH'
    and c1.acceptor_id < c2.acceptor_id;

Here is the result:

donor_id    acceptor_id  directness
----------  -----------  ----------
2           4            W
2           5            W
3           4            W
3           5            W
7           9            W
7           10           W
8           9            W
8           10           W
2           3            X
7           8            X
4           5            X
9           10           X

newtover 2010-05-25 21:23:25

You are correct. That's what the table `contacts` is for. `atoms` is just definition of available atoms.

Krab 2010-05-25 21:39:25

@Krab: you are right, I updated my answer with the resulting query.

newtover 2010-05-25 22:20:38

Thank you both. Both your solutions give, as far as I have tested, the same set of contacts (with some duplicates which are easy to filter). I decided to accept newtover's solution, because it gives the information W/X, which I am currently not sure how to add to Francis' solution. Also, as it's more verbose, it helps me not to get lost. However, Francis' solution seems to be faster. Machine time is not a big limit here, so I'll start with the accepted solution and maybe later I will figure how to adjust the former one.

Krab 2010-05-26 06:43:18

This solution is indeed very close to what I had originally, except that it was not a single statement but three python loops, each with two select and one insert statement. As a sql newbie (I've done only simple stuff for web in sql), I didn't realize two things. 1. You can join a table to self (c1.acceptor_id = c2.donor_id).and2. id1 < id2 removes duplicates (like in C-D, D-C).

Krab 2010-05-26 06:50:15

@Krabs: concering the performance, the current contacts table lacks a couple of indexes: (acceptor_id, donor_id) and (donor_id, acceptor_id). And atoms lacks (id, amino). This should help to speed up the query if you need it. Glad to help.

newtover 2010-05-26 07:21:48

@newtover: Thank you for additional info. From my previous tries, I have some indexes, which are suitable (EXPLAIN says they are used) for my queries. I will adjust them to support this query.

Krab 2010-05-26 07:33:45

@Krab: thought it over once again. Index on just (amino), instead of (id, amino) will seemingly be better. This will affect the order of joins, and to start from `atoms` seems to result in smaller intermediate join products.

newtover 2010-05-26 09:14:54

ansaurus

tags:

views:

answers:

Refining data stored in SQLite - how to join several contacts?

Problem background

Database structure

Current solution (insufficient)

Is there a simple way to retrieve all six contacts without duplicates?

related questions