ansaurus

Question

joining two tab-delimited files by column with same identifiers in ONE step (command)?

Answer 1

A:

You can automate your task with a bash script, and without the use of temporary files, like in this example:

#!/bin/bash

id="$1"
file1="$2"
file2="$3"

# get a filename as a parameter
# read first line of file to get $id position
get_pos() {
  awk -v id="$id" '{
      for (i = 1; i <= NF; i++)
        if ($i == id) {
          print i
          exit
        }
    }' "$1"
}

# get $id positions from headers of the two files
pos1=$(get_pos "$file1")
pos2=$(get_pos "$file2")

# print header
printf "%s\t" "$id"
head -n1 "$file1" | sed -r "s/$id(\t|$)//" | tr -d '\n'
head -n1 "$file2" | sed -r "s/$id(\t|$)//"

# print data, add -a1 option for left join
join -t$'\t' -1 $pos1 -2 $pos2 \
  <(tail -n+2 "$file1" | sort) \
  <(tail -n+2 "$file2" | sort)

It do not do any error check, and maybe can be realized with other tools, like python, in a more elegant way, but I hope it help.

enzotib 2010-10-06 21:37:46

Answer 2

A:

A completely different approach would be to use a lightweight SQL tool, like sqlite.

You can create two tables:

$ sqlite3
SQLite version 3.7.2 
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> create table fruit (fruit varchar(20), id varchar(20));
sqlite> create table foobar (id varchar(20), foobar varchar(20));

set TAB as a separator and load your files:

sqlite> .separator "\t"
sqlite> .import file1 fruit
sqlite> .import file2 foobar

delete headers:

sqlite> delete from fruit where id = 'ID';
sqlite> delete from foobar where id = 'ID';

then perform all queries you need:

sqlite> select fruit.id, fruit, foobar from fruit, foobar where fruit.id = foobar.id;
alpha   apple   cat
beta    banana  dog
sqlite> .quit
$

It is also possible to automate the task with the help of bash here docs:

#!/bin/bash

sqlite3 <<-EOF
        create table fruit (fruit varchar(20), id varchar(20));
        create table foobar (id varchar(20), foobar varchar(20));
        .separator "\t"
        .import file1 fruit
        .import file2 foobar
        delete from fruit where id = 'ID';
        delete from foobar where id = 'ID';
        select fruit.id, fruit, foobar from fruit, foobar where fruit.id = foobar.id;
        .quit
EOF

enzotib 2010-10-07 20:14:48

ansaurus

tags:

views:

answers:

joining two tab-delimited files by column with same identifiers in ONE step (command)?

related questions