tags:

views:

323

answers:

8

How can I use shell one-liners and common GNU tools to concatenate lines in two files as in Cartesian product? What is the most succinct, beautiful and "linuxy" way?

For example, if I have two files:

$ cat file1
a
b
$ cat file2
c
d
e

The result should be

a, c
a, d
a, e
b, c
b, d
b, e
+2  A: 

Edit: Oops... Sorry, I thought this was tagged python...

If you have python 2.6:

from itertools import product
print('\n'.join((', '.join(elt) for elt in (product(*((line.strip() for line in fh) for fh in (open('file1','r'), open('file2','r'))))))))

a, c
a, d
a, e
b, c
b, d
b, e

If you have python pre-2.6:

def product(*args, **kwds):
    '''
    Source: http://docs.python.org/library/itertools.html#itertools.product
    '''
    # product('ABCD', 'xy') --> Ax Ay Bx By Cx Cy Dx Dy
    # product(range(2), repeat=3) --> 000 001 010 011 100 101 110 111
    pools = map(tuple, args) * kwds.get('repeat', 1)
    result = [[]]
    for pool in pools:
        result = [x+[y] for x in result for y in pool]
    for prod in result:
        yield tuple(prod)
print('\n'.join((', '.join(elt) for elt in (product(*((line.strip() for line in fh) for fh in (open('file1','r'), open('file2','r'))))))))
unutbu
That would work, but python is not what I've been asking for.
Pavel Shved
+1  A: 

Solution 1:

perl -e '{use File::Slurp; @f1 = read_file("file1"); @f2 = read_file("file2"); map { chomp; $v1 = $_; map { print "$v1,$_"; } @f2 } @f1;}'

DVK
Why did you use `map` here? Those should be `for` loops.
Kinopiko
@Kinopiko: Weren't you just complaining about "language police" on a different thread?
Telemachus
The only thing I like to use more than maps is Regular Expressions. :)
DVK
@Telemachus: If you can't beat them, join them.
Kinopiko
Language Police is right here: Language Cops are coming and busting you! :-)
Pavel Shved
Do you have a _badge_ for that? :)
DVK
+6  A: 

Here's shell script to do it

while read a; do while read b; do echo "$a, $b"; done < file2; done < file1

Though that will be quite slow. I can't think of any precompiled logic to accomplish this. The next step for speed would be to do the above in awk/perl.

awk 'NR==FNR { a[$0]; next } { for (i in a) print i",", $0 }' file1 file2

Hmm, how about this hacky solution to use precompiled logic?

paste -d, <(sed -n "$(yes 'p;' | head -n $(wc -l < file2))" file1) \
          <(cat $(yes 'file2' | head -n $(wc -l < file1)))
pixelbeat
@Pixelbeat: your first version needs to reverse the order of `file1` and `file2`. (That is, it should be `done < file2; done < file 1` to get the desired result.
Telemachus
@Telemachus , the order is irrelevant: if I say "Cartesian product", I really *mean it*.
Pavel Shved
+5  A: 

The mechanical way to do it in shell, not using Perl or Python, is:

while read line1
do
    while read line2
    do echo "$line1, $line2"
    done < file2
done < file1

The join command can sometimes be used for these operations - however, I'm not clear that it can do cartesian product as a degenerate case.

One step up from the double loop would be:

while read line1
do
    sed "s/^/$line1, /" file2
done < file1
Jonathan Leffler
I'd go for the first solution because it doesn't make the files look like they're substantially different.
Pavel Shved
It (the first solution) would likely be substantially slower - but it would also be immune to odd characters (such as slashes) in the data. Fixing things so that is not a problem is a bit fiddlier, and at that point you start thinking about using Perl or Python after all.
Jonathan Leffler
@Pavel - thanks for the editorial assist.
Jonathan Leffler
+2  A: 
Dennis Williamson
nice. but i sure would not want to maintain this script. :)
ghostdog74
Truly delightful, but unmaintainable. :)
Pavel Shved
+1  A: 
awk 'FNR==NR{ a[++d]=$1; next}
{
  for ( i=1;i<=d;i++){
    print $1","a[i]
  }
}' file2 file1

# ./shell.sh
a,c
a,d
a,e
b,c
b,d
b,e
ghostdog74
+1  A: 

OK, this is derivation of Dennis Williamson's solution above since he noted that his does not read from file:

$ echo {`cat a | tr "\012" ","`}\,\ {`cat b | tr "\012" ","`}$'\n'
a, c
 a, d
 a, e
 b, c
 b, d
 b, e
DVK
This is what that gives me: `{a,b,}, {c,d,e,}` as a literal string.
Dennis Williamson
+1  A: 

A solution using join, awk and process substitution:

join <(xargs -I_ echo 1 _ < setA) <(xargs -I_ echo 1 _ < setB)
  | awk '{ printf("%s, %s\n", $2, $3) }'
Yassin
What is the contents of the file "a"? Should one of them be a different file? The AWK could probably be replaced by `cut -f2- -d' '`.
Dennis Williamson
The "a" file contains the set. They may be different if wanted. I'll correct it!
Yassin
@Dennis, `cut` is probably better, since it works even if `setB` contains lines with whitespaces.
Pavel Shved