views:

35

answers:

1

I'm trying to combine two lists, joining them by a common field suchs as ENST00000371026. I've tried the following but no luck. What is the actual way to do it?

cat> gar1.txt <<EOF
ENST00000371026 ENSG00000152763
ENST00000371023 ENSG00000152763
ENST00000395250 ENSG00000152763
ENST00000309502 ENSG00000163485
ENST00000377464 ENSG00000142599
ENST00000400908 ENSG00000142599
ENST00000337907 ENSG00000142599
ENST00000400907 ENSG00000142599
ENST00000401087 ENSG00000179571
EOF

cat> gar2.txt <<EOF
DDX11L1 ENST00000371026
DDX11L9 ENST00000309502
DDX11L1 ENST00000371026
OR4F5   ENST00000377464
DQ597235        n/a
DQ599768        n/a
LOC388312       ENST00000401087
LOC100132287    ENST00000425496
LOC100132287    ENST00000425496
EOF

join -t"\t" -2 2 -1 1 gar1.txt gar2.txt

An example output would be:

DDX11L1 ENSG00000152763
DDX11L9 ENSG00000163485
OR4F5   ENSG00000142599
LOC388312       ENSG00000179571
LOC100132287    NONE
+1  A: 

Give this a try:

join -a 2 -t $'\t' -2 2 -1 1 -o 2.1 1.2 <(sort gar1.txt) <(sort -k 2,2 gar2.txt) | sed '/\t$/ s/$/NONE/' | uniq

It uses Bash process substitution (<()), but you could simply presort your files if you don't want to use that. Your version of join needs to have the -o option or you could use awk to process the output further. For the "n/a" lines it will print "NONE", but you could do this to eliminate them:

<(sort -k 2,2 gar2.txt | grep -v 'n/a$')

Edit:

This is the output I get without the sed and uniq:

DDX11L9 ENSG00000163485
DDX11L1 ENSG00000152763
DDX11L1 ENSG00000152763
OR4F5   ENSG00000142599
LOC388312       ENSG00000179571
LOC100132287
LOC100132287
DQ597235
DQ599768

And this is what I get with them:

DDX11L9 ENSG00000163485
DDX11L1 ENSG00000152763
OR4F5   ENSG00000142599
LOC388312       ENSG00000179571
LOC100132287    NONE
DQ597235        NONE
DQ599768        NONE

Adding the grep:

DDX11L9 ENSG00000163485
DDX11L1 ENSG00000152763
OR4F5   ENSG00000142599
LOC388312       ENSG00000179571
LOC100132287    NONE

which matches your example output except for the order. If you need to preserve the order it would be possible to do.

Dennis Williamson
+1 helpful, however join -a 2 -t $'\t' -2 2 -1 1 -o 2.1 1.2 <(sort gar1.txt) <(sort -k 2,2 gar2.txt) seems to output a sorted gar2.txt rather than the expect output. Did it work for you for the example files?
D W
@D W: See my edit.
Dennis Williamson
I was doing a silly mistake. The script works great!
D W