ansaurus

Question

bash, Linux: Set difference between two text files

Answer 1

+9 A:

The comm command does that.

msw 2010-03-24 16:45:38

And if the files are not sorted yet, `sort` first.

extraneon 2010-03-24 16:47:47

+1 Enlightened, great tool that I feel stupid not to have known. Thanks!

Adam Matan 2010-03-24 17:10:50

@Adam Matan: much more enlightenment available at `ls /bin /usr/bin | xargs man`

just somebody 2010-03-24 18:08:59

@Just Won't start a flame war here, but you comment is just rude.

Adam Matan 2010-03-24 18:43:48

@Adam: Ironically, that "comm" bit of arcana dates back to a time when you could keep the whole contents of /bin and /usr/bin in your head, before all these fancy perls and pythons and mysqls. Back in those simpler V7 days you had to make use of all the tools or (gasp!) write your own, with ed(1), in the snow, uphill both ways, and we liked it! ;) I'd probably never know of comm if I'd started later.

msw 2010-03-24 23:13:16

@msw True, and its still amazing how useful can these CLI tools be, even with the new shining DBMS.

Adam Matan 2010-03-25 08:33:13

@Adam Matan: I'm sorry, rudeness definitely wasn't my intention. In fact, the command I posted is a good way to learn a great deal about the system, and I used to do stuff like that to enlighten myself. Otherwise e. g. `join(1)` would have remained unknown to me.

just somebody 2010-03-26 18:00:06

@Just no offence taken, thanks!

Adam Matan 2010-03-27 21:20:45

Answer 2

+1 A:

The first thing that came in my mind is:

diff nodes_to_delete nodes_to_keep | grep '<'

I've answered before your edit, so I don't think this might still apply if you found the db way to be slow...

Alberto Zaccagni 2010-03-24 16:46:56

Answer 3

+1 A:

Maybe you need a better way to do it in postgres, I can pretty much bet that you won't find a faster way to do it using flat files. You should be able to do a simple inner join and assuming that both id cols are indexed that should be very fast.

2010-03-24 16:50:15

You're technically correct, and the `explain` supports your claim, but it simply doesn't work for very large (~tens of millions) tables.

Adam Matan 2010-03-24 17:10:10

Yeah it would be constrained by your memory unlike something like a sorted comm but I would think that if you have two tables with only an int id field that you could get into the 10s of millions with no trouble.

2010-03-24 17:14:48

That's right in theory, but it simply doesn't work for some reason.

Adam Matan 2010-03-24 17:23:58

ansaurus

tags:

views:

answers:

bash, Linux: Set difference between two text files

related questions