tags:

views:

59

answers:

4

How can I join all of the files in a directory. I can do it in one step by explicitly naming the files below, is there a way to do it without explicitly naming the files?

join <(\
join <(\
join <(\
join\
<(sort ${rpkmDir}/HS0477.chsn.rpkm)\
<(sort ${rpkmDir}/HS0428.chsn.rpkm) )\
<(sort ${rpkmDir}/HS0419.chsn.rpkm) )\
<(sort ${rpkmDir}/HS0299.chsn.rpkm) )\
<(sort ${rpkmDir}/HS0445.chsn.rpkm)
A: 

You can do it by cat ./* >outfile

xt.and.r
No - that does not work. Join finds matching lines in the files based on a key (since no key is specified, on the first column of each file), assuming that the files are all sorted in the same order.
Jonathan Leffler
+2  A: 
#!/bin/bash

data=
for f in "${rpkmDir}"/HS*.chsn.rpkm
do
  if [ ! "$data" ]
  then
    data="$(sort "$f")"
    continue
  fi
  data="$(join <(sort "$f") /dev/stdin <<< "$data")"
done
echo "$data"
Ignacio Vazquez-Abrams
Do you need to 'echo "$data"' into a pipe into bash? Or explain that you have generated the script and need to execute what you produced as a shell script?
Jonathan Leffler
It is indeed a script. I had hoped that the shebang line at the top would have made this apparent.
Ignacio Vazquez-Abrams
This is a script that writes a script - I think. You then have to feed the output of the script shown into the shell. Normally, you just execute a script to ... get the commands executed. Here you have to execute your script and then run bash on the output.
Jonathan Leffler
Incorrect. It uses command and process substitution to build up the results in `$data`, feeding it back into `join` each iteration.
Ignacio Vazquez-Abrams
I should have mentioned I didn't need to eliminate unmatched lines, making @ghostdog74's answer most concise for what I need. Still, your answer will be very useful when I need that functionality +1. I wish I could accept two answers.
D W
*shrug* I merely extended your example to an arbitrary set of files.
Ignacio Vazquez-Abrams
+1  A: 

Since the join (in Classic UNIX and under POSIX) is defined so it works on strictly two files at a time, you are going to have to do the iteration yourself, somehow.

While your notation is marvellously minimal, it is also inscrutable. The chances are that you can use pipes and the fact that '-' as a file name denotes standard input to alter the sequencing, I think. But the hard part is connecting everything together without creating any explicit temporary files. You may be best off simply writing a script that writes your script notation, and feeds that into bash.

Maybe (untested script):

cd ${rpkmDir}
ls HS*.chsn.rpkm |
{
read file
script="sort $file"
while read file
do
    script="$script | join - <(sort $file)"
done
} | bash
Jonathan Leffler
I wasn't aware of the '-' trick. Thanks +1
D W
+1  A: 

use awk, say you want to join on 1st field

awk '{a[$1]=a[$1] FS $0}END{for(i in a) print i,a[i]}' file*
ghostdog74
That doesn't eliminate lines where file1 contains the key and file2 does not - whereas the join command (without options) does eliminate unmatched lines.
Jonathan Leffler
correct me if i am wrong, but i don't see OP stating that requirement. And I already stated in my post my assumption based on an example on the first field. Until OP elaborates on his data format, all solutions will based on wild guesses and assumptions. BTW, its also not that difficult to include code to do what you are assuming.
ghostdog74
using paste in some way would seem to be better for this application
D W
you will have to sort as well when using paste
ghostdog74
that's true, anyways I guess I have to choose Ignacio Vazquez-Abrams' answer because it answers the original question, even though your solution is useful to me.
D W