views:

89

answers:

2

I'm storing contacts between different elements. I want to eliminate elements of certain type and store new contacts of elements which were interconnected by the eliminated element.

Problem background

Imagine this problem. You have a water molecule which is in contact with other molecules (if the contact is a hydrogen bond, there can be 4 other molecules around my water). Like in the following picture (A, B, C, D are some other atoms and dots mean the contact).

 A   B
 |   |
 H   H
  . .
   O
  / \
 H   H
 .   .
 C   D

I have the information about all the dots and I need to eliminate the water in the center and create records describing contacts of A-C, A-D, A-B, B-C, B-D, and C-D.

Database structure

Currently, I have the following structure in the database:

Table atoms:

  • "id" integer PRIMARY KEY,
  • "amino" char(3) NOT NULL, (HOH for water or other value)
  • other columns identifying the atom

Table contacts:

  • "acceptor_id" integer NOT NULL, (the atom near to my hydrogen, here C or D)
  • "donor_id" integer NOT NULL, (here A or B)
  • "directness" char(1) NOT NULL, (this should be D for direct and W for water-mediated)
  • other columns about the contact, such as the distance

EDIT: How would look the data in the case depicted earlier.

atoms:

id|atom|amino
1 | O  | HOH
2 | N  | ARG  <- atom A from image
3 | S  | CYS  <- B 
4 | O  | SER  <- C
5 | N  | ARG  <- D

contacts:

donor_id|acceptor_id|directness
1        4           D
1        5           D
2        1           D
3        1           D

From which I need to make

contacts:

donor_id|acceptor_id|directness
3        4           W            <- B-C
3        5           W            <- B-D
2        4           W            <- A-C
2        5           W            <- A-D
2        3           X            <- A-B    (These last two rows are escaping me,
4        5           X            <- C-D     there could be also row D-C, but not
                                             both C-D and D-C. A char 'X' could 
                                             be used to mark "no donor/acceptor")

Current solution (insufficient)

Now, I'm going through all the contacts which have donor.amino = "HOH". In this sample case, this would select contacts from C and D. For each of these selected contacts, I look up contacts having the same acceptor_id as is the donor_id in the currently selected contact. From this information, I create the new contact. At the end, I delete all contacts to or from HOH.

This way, I am obviously unable to create C-D and A-B contacts (the other 4 are OK).

If I try a similar approach - trying to find two contacts having the same donor_id, I end up with duplicate contacts (C-D and D-C).

Is there a simple way to retrieve all six contacts without duplicates?

I'm dreaming about some one page long SQL query which retrieves just these six wanted rows. :-)
However, any other ideas are welcome.

It is preferable to conserve information about who is donor (where possible), but not strictly necessary.

Big thanks to all of you who read this question to this point.

+2  A: 

There's one difficulty with your explanation.

What you start with is a directed graph where each edge represents a connection X=>Y where X is a donor and Y an acceptor. The table atoms is the SQL representation of that graph.

What you seem to want is something that is undirected. So that a link X-Y means that X and Y are linked via a water molecule (or some other species of course) but that X and Y could both be donors or acceptors. For this reason your last table has an ambiguity (that you note) so that some links could occur either way around. It seems to me that means that the column headings donor_id and acceptor_id on your final column don't have any meaning that you have explained. This may be my confusion of course.

If all you want is a table with all 6 links in it, each as one row, but don't worry too much about keeping track of the donor/acceptor thing then this works for me in sqlite3:

 create temporary view hoh_view as 
 select donor_id as id, atoms.id as hoh_id from contacts, atoms 
       where acceptor_id=atoms.id and atoms.amino='HOH' 
 union select acceptor_id as id, atoms.id as hoh_id from contacts, atoms 
       where donor_id=atoms.id and atoms.amino='HOH';

 select a.id, b.id from hoh_view as a, hoh_view as b 
       where a.id > b.id and a.hoh_id=b.hoh_id;

Where I have used a temporary view to make things clearer. You can put this all into one big query if you like by replacing each reference to hoh_view by the first query. It feels a bit nasty to me and there may be a way of tidying it up.

If you do want to keep track of donor/acceptor relationships you need to explain how you decide what to do when both amino acids are acceptors or donors (i.e. the last two rows in your example).

If that doesn't do what you want, then maybe I can fix it up so it does.

Francis Davey
SQL isn't very good at representing graphs because its relational.
Francis Davey
Thank you. The original connections are directed. Many of my new connections can be directed too (A donor, D acceptor). How would you recommend to represent undirected data? Because I could do without the information about the direction of the edge.
Krab
Unfortunately, your code doesn't work as I need. It doesn't care about the fact that the water must be the same - it would create connections between any atoms connected to any water in my db which satisfy a.id > b.id. :-( Would grouping the last select help somehow? There is also an alternative - if it's easier, the task could be solved by removing one of the undirected connections (C-D, D-C). But again, I have no idea how to do it reliably. Thank you for your effort
Krab
@Krab: the original data lacks any grouping information, and Fransis suggested a solution that does not use any grouping. If the information is present, both `WHERE` clauses will include an additional condition. `a.id > b.id` introduce a directed connection based on id value and eliminating duplicates like (C-D, D-C).
newtover
@newtower: The original data contain the grouping information implicitly. Imagine you have side by side twice the same problem as depicted - having original atoms as well as O1, A1, B1, C1, D1. Now, the refined db should contain only 12 connections, 6 for each water, not connections like A1-C. Or maybe I just didn't understand how would you make an additional condition.
Krab
@Krab - you are quite right. What I originally had in mind was a top level select on HOH elements and then using that to direct the later searches. Let me see if I can think of something.
Francis Davey
I've made a correction to permit only the same HOH, which I hope works. If you have some data in insert form that would be great.
Francis Davey
@Donnie: I got the database by parsing a text file having one contact per row. I saved it to a db separating info about contacts and atoms into two tables (as atoms can occur in multiple contacts). How would you recommend to store this data better?
Krab
@Francis Davey: This seems to work like I need. I will test it further tomorrow, but it looks all right. Thank you very much.
Krab
If you are interested in a sample of my data in sql insert format, look at http://web.vscht.cz/sedlakf/contacts.sql
Krab
+1  A: 

Well, its hard to provide examples in comments, I decided to post an answer:

If you have to following original data, there is no way to distinguish data from the first structure from those of the second. There should be an additional grouping condition to eleminate directions between the first and the second structure.

sqlite> create table atoms (id INT, atom TEXT, amino TEXT);
sqlite> insert into atoms VALUES (1, 'O', 'HOH');
sqlite> insert into atoms VALUES (2, 'A', 'ARG');
sqlite> insert into atoms VALUES (3, 'B', 'CYS');
sqlite> insert into atoms VALUES (4, 'C', 'SER');
sqlite> insert into atoms VALUES (5, 'D', 'ARG');
sqlite> insert into atoms VALUES (6, 'O1', 'HOH');
sqlite> insert into atoms VALUES (7, 'A1', 'ARG');
sqlite> insert into atoms VALUES (8, 'B1', 'CYS');
sqlite> insert into atoms VALUES (9, 'C1', 'SER');
sqlite> insert into atoms VALUES (10, 'D1', 'ARG');
sqlite> select * from atoms;
1|O|HOH
2|A|ARG
3|B|CYS
4|C|SER
5|D|ARG
6|O1|HOH
7|A1|ARG
8|B1|CYS
9|C1|SER
10|D1|ARG

UPD

Here is the original data:

sqlite> .headers ON
sqlite> .mode columns
sqlite> select * from atoms;
id          atom        amino
----------  ----------  ----------
1           O           HOH
2           A           ARG
3           B           CYS
4           C           SER
5           D           ARG
6           O1          HOH
7           A1          ARG
8           B1          CYS
9           C1          SER
10          D1          ARG
sqlite> select * from contacts;
donor_id    acceptor_id  directness
----------  -----------  ----------
1           4            D
1           5            D
2           1            D
3           1            D
6           9            D
6           10           D
7           6            D
8           6            D

Here is the query:

select
    c1.donor_id, c2.acceptor_id, 'W' as directness
from
    contacts c1, contacts c2, atoms a
where
    c1.acceptor_id = c2.donor_id
    and c1.acceptor_id=a.id
    and a.amino='HOH'
UNION ALL
select
    c1.donor_id, c2.donor_id, 'X' as directness
from
    contacts c1, contacts c2, atoms a
where
    c1.acceptor_id = c2.acceptor_id
    and c1.acceptor_id=a.id
    and a.amino='HOH'
    and c1.donor_id < c2.donor_id
UNION ALL
select
    c1.acceptor_id, c2.acceptor_id, 'X' as directness
from
    contacts c1, contacts c2, atoms a
where
    c1.donor_id = c2.donor_id
    and c1.donor_id=a.id
    and a.amino='HOH'
    and c1.acceptor_id < c2.acceptor_id;

Here is the result:

donor_id    acceptor_id  directness
----------  -----------  ----------
2           4            W
2           5            W
3           4            W
3           5            W
7           9            W
7           10           W
8           9            W
8           10           W
2           3            X
7           8            X
4           5            X
9           10           X
newtover
You are correct. That's what the table `contacts` is for. `atoms` is just definition of available atoms.
Krab
@Krab: you are right, I updated my answer with the resulting query.
newtover
Thank you both. Both your solutions give, as far as I have tested, the same set of contacts (with some duplicates which are easy to filter). I decided to accept newtover's solution, because it gives the information W/X, which I am currently not sure how to add to Francis' solution. Also, as it's more verbose, it helps me not to get lost. However, Francis' solution seems to be faster. Machine time is not a big limit here, so I'll start with the accepted solution and maybe later I will figure how to adjust the former one.
Krab
This solution is indeed very close to what I had originally, except that it was not a single statement but three python loops, each with two select and one insert statement. As a sql newbie (I've done only simple stuff for web in sql), I didn't realize two things. 1. You can join a table to self (c1.acceptor_id = c2.donor_id).and2. id1 < id2 removes duplicates (like in C-D, D-C).
Krab
@Krabs: concering the performance, the current contacts table lacks a couple of indexes: (acceptor_id, donor_id) and (donor_id, acceptor_id). And atoms lacks (id, amino). This should help to speed up the query if you need it. Glad to help.
newtover
@newtover: Thank you for additional info. From my previous tries, I have some indexes, which are suitable (EXPLAIN says they are used) for my queries. I will adjust them to support this query.
Krab
@Krab: thought it over once again. Index on just (amino), instead of (id, amino) will seemingly be better. This will affect the order of joins, and to start from `atoms` seems to result in smaller intermediate join products.
newtover