views:

47

answers:

2

Very often I want to join two ascii-files, which are both tables in the sense that they consist of columns separated by tab, like this:

file 1

FRUIT   ID
apple   alpha
banana  beta
cherry  gamma

file 2

ID  FOOBAR
alpha   cat
beta    dog
delta   airplane

and I want to join them like this with an inner join:

FRUIT   ID  FOOBAR
apple   alpha   cat
banana  beta    dog

or with a left join:

FRUIT   ID  FOOBAR
apple   alpha   cat
banana  beta    dog
cherry  gamma   n/a

(The identifiers used for joining are not necessarily unique.)

What I am doing so far is:

  1. Make copies of the input files without header.
  2. Sort the input files by column.
  3. Use the linux join command on the sorted versions.
  4. Delete intermediate files.

This is error prone as I need to count the columns to specify them later to "sort" and "join" by number (even more error prone with lots of columns and very broad columns), I must not forget to specify that tab is the delimiter and need to remove/insert/fix the header each time etc.

Can anyone recommend a much simpler way? Prefereably where I don't need to sort and where I can specify the column by name, not number? Something like "joincommand ID file1 file2 > result"?

A: 

You can automate your task with a bash script, and without the use of temporary files, like in this example:

#!/bin/bash

id="$1"
file1="$2"
file2="$3"

# get a filename as a parameter
# read first line of file to get $id position
get_pos() {
  awk -v id="$id" '{
      for (i = 1; i <= NF; i++)
        if ($i == id) {
          print i
          exit
        }
    }' "$1"
}

# get $id positions from headers of the two files
pos1=$(get_pos "$file1")
pos2=$(get_pos "$file2")

# print header
printf "%s\t" "$id"
head -n1 "$file1" | sed -r "s/$id(\t|$)//" | tr -d '\n'
head -n1 "$file2" | sed -r "s/$id(\t|$)//"

# print data, add -a1 option for left join
join -t$'\t' -1 $pos1 -2 $pos2 \
  <(tail -n+2 "$file1" | sort) \
  <(tail -n+2 "$file2" | sort)

It do not do any error check, and maybe can be realized with other tools, like python, in a more elegant way, but I hope it help.

enzotib
A: 

A completely different approach would be to use a lightweight SQL tool, like sqlite.

You can create two tables:

$ sqlite3
SQLite version 3.7.2 
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> create table fruit (fruit varchar(20), id varchar(20));
sqlite> create table foobar (id varchar(20), foobar varchar(20));

set TAB as a separator and load your files:

sqlite> .separator "\t"
sqlite> .import file1 fruit
sqlite> .import file2 foobar

delete headers:

sqlite> delete from fruit where id = 'ID';
sqlite> delete from foobar where id = 'ID';

then perform all queries you need:

sqlite> select fruit.id, fruit, foobar from fruit, foobar where fruit.id = foobar.id;
alpha   apple   cat
beta    banana  dog
sqlite> .quit
$ 

It is also possible to automate the task with the help of bash here docs:

#!/bin/bash

sqlite3 <<-EOF
        create table fruit (fruit varchar(20), id varchar(20));
        create table foobar (id varchar(20), foobar varchar(20));
        .separator "\t"
        .import file1 fruit
        .import file2 foobar
        delete from fruit where id = 'ID';
        delete from foobar where id = 'ID';
        select fruit.id, fruit, foobar from fruit, foobar where fruit.id = foobar.id;
        .quit
EOF
enzotib