views:

286

answers:

3

Suppose I have two lists of numbers in files f1, f2, each number one per line. I want to see how many numbers in the first list are not in the second and vice versa. Currently I am using grep -f f2 -v f1 and then repeating this using a shell script. This is pretty slow (quadratic time hurts). Is there a nicer way of doing this?

+2  A: 

Couldn't you just put each number in a single line and then diff(1) them? You might need to sort the lists beforehand, though for that to work properly.

Joey
Will that actually provide counts?
Casebash
Not as such, but you can get that with `grep`/`wc` afterwards. This was just a suggestion on how to improve the quadratic runtime. You will get a somehow (depending on the options to `diff`) readable list of differences. You can just count them, then.
Joey
Okay, will have to play around with this
Casebash
diff will have a < for values in the second, but not the first and > for values in the first but not the second. A simple grep and wc should provide the desired answer
Casebash
A: 

In the special case where one file is a subset of the other, the following:

cat f1 f2 | sort | uniq -u

would list the lines only in the larger file. And of course piping to wc -l will show the count.

However, that isn't exactly what you described.

This one-liner serves my particular needs often, but I'd love to see a more general solution.

pavium
+2  A: 

I like 'comm' for this sort of thing. (files need to be sorted.)

$ cat f1
1
2
3
$ cat f2
1
4
5
$ comm f1 f2
     1
2
3
    4
    5
$ comm -12 f1 f2
1
$ comm -23 f1 f2
2
3
$ comm -13 f1 f2
4
5
$
Stephen Paul Lesniewski
For numerical results it complained that it wasn't in sorted order. --nocheck-order will suppress
Casebash
Again, a simple grep and wc can be used to find the actual result
Casebash