views:

374

answers:

3

Hi,

I have two files A-nodes_to_delete and B-nodes_to_keep. Each file has a many lines with numeric ids.

I want to have the list of numeric ids that are in nodes_to_delete but NOT in nodes_to_keep, e.g. alt text.

Doing it within a PostgreSQL database is unreasonably slow. Any neat way to do it in bash using Linux CLI tools?

UPDATE: This would seem to be a Pythonic job, but the files are really, really large. I have solved some similar problems using uniq, sort and some set theory techniques. This was about two or three orders of magnitude faster than the database equivalents.

Adam

+9  A: 

The comm command does that.

msw
And if the files are not sorted yet, `sort` first.
extraneon
+1 Enlightened, great tool that I feel stupid not to have known. Thanks!
Adam Matan
@Adam Matan: much more enlightenment available at `ls /bin /usr/bin | xargs man`
just somebody
@Just Won't start a flame war here, but you comment is just rude.
Adam Matan
@Adam: Ironically, that "comm" bit of arcana dates back to a time when you could keep the whole contents of /bin and /usr/bin in your head, before all these fancy perls and pythons and mysqls. Back in those simpler V7 days you had to make use of all the tools or (gasp!) write your own, with ed(1), in the snow, uphill both ways, and we liked it! ;) I'd probably never know of comm if I'd started later.
msw
@msw True, and its still amazing how useful can these CLI tools be, even with the new shining DBMS.
Adam Matan
@Adam Matan: I'm sorry, rudeness definitely wasn't my intention. In fact, the command I posted is a good way to learn a great deal about the system, and I used to do stuff like that to enlighten myself. Otherwise e. g. `join(1)` would have remained unknown to me.
just somebody
@Just no offence taken, thanks!
Adam Matan
+1  A: 

The first thing that came in my mind is:

diff nodes_to_delete nodes_to_keep | grep '<'

I've answered before your edit, so I don't think this might still apply if you found the db way to be slow...

Alberto Zaccagni
+1  A: 

Maybe you need a better way to do it in postgres, I can pretty much bet that you won't find a faster way to do it using flat files. You should be able to do a simple inner join and assuming that both id cols are indexed that should be very fast.

You're technically correct, and the `explain` supports your claim, but it simply doesn't work for very large (~tens of millions) tables.
Adam Matan
Yeah it would be constrained by your memory unlike something like a sorted comm but I would think that if you have two tables with only an int id field that you could get into the 10s of millions with no trouble.
That's right in theory, but it simply doesn't work for some reason.
Adam Matan