ansaurus

Question

What are some one-liners that can output unique elements of the nth column to another file?

Answer 1

+3 A:

Taking the unique values of the third column:

$ cat in.txt | cut -d' ' -f 3 | sort -u
3
4
6
8

cut -d' ' means to separate the input delimited by spaces, and the -f 3 part means take the third field. Finally, sort -u sorts the output, keeping only unique entries.

Mark Rushakoff 2009-08-14 04:34:43

Why use cat there? Any command beginning "cat filename | nextcommand" can be written more efficiently as "nextcommand < filename".

j_random_hacker 2009-08-14 04:46:42

Merely force of habit. It reads easier to me from left to right. You're absolutely right that input redirection beats cat... if we are going to get nit-picky, we could just as well do `cut -d' ' -f 3 in.txt` anyway.

Mark Rushakoff 2009-08-14 04:50:15

You can actually even write it as "< filename nextcommand" if you want to put the filename at the left. Looks weird, but it works! :)

j_random_hacker 2009-08-14 05:02:50

You were the first to teach me of 'cut' and 'sort'. Thanks a lot!

Lin 2009-08-14 06:13:05

Answer 2

+6 A:

Corrected: Thank you Mark Rushakoff.

$ cut -c 1 t.txt | sort | uniq

or

$ cut -c 1 t.txt | sort -u


1
4
7
9

Sinan Ünür 2009-08-14 04:37:31

+1 uniq is The Way (10k vote is mine)

dfa 2009-08-14 04:40:44

uniq doesn't strip duplicates that aren't adjacent. `cut -c 5 t.txt | uniq` for the third column fails because the 3s aren't next to each other. That's why you do `sort | uniq` or more recently, just `sort -u`.

Mark Rushakoff 2009-08-14 04:43:01

@dfa Thank you! The suspense is over. ;-)

Sinan Ünür 2009-08-14 04:43:12

Now that he has hit 10k he might slow down. He is endangering my position in the standings.

Chas. Owens 2009-08-14 04:45:15

-1 sorry, for the reason Mark gave.

j_random_hacker 2009-08-14 04:47:39

@j_random_hacker and @Mark yup, sorry about that. @Chas. ;-)

Sinan Ünür 2009-08-14 04:49:36

All is well again :) +2!

j_random_hacker 2009-08-14 04:58:05

The downside to `sort -u` or `sort | uniq` is no more processing can be done until everything has been read (and sorted). This is the reason I like Perl for this job.

Chas. Owens 2009-08-14 05:17:27

@Chas: Yes, but the upside is that it's much more memory-efficient. If the job is small, both this and your solution work quickly, but for large jobs, your Perl solution will thrash.

j_random_hacker 2009-08-14 06:04:57

Answer 3

+3 A:

Say your file is "cols.txt" and you want the unique elements of the second column:

awk '{ print $2 }' cols.txt | uniq

You might find the following article useful for learning more about such utilities:

Simplify data extraction using Linux text utilities

ars 2009-08-14 04:38:04

Thanks for that link. Very useful!

Lin 2009-08-14 06:14:03

Answer 4

+8 A:

In Perl before 5.10

perl -lane 'print $F[0] unless $h{$F[0]}++' filename

In Perl after 5.10

perl -anE 'say $F[0] unless $h{$F[0]}++' filename

Replace 0 with the column you want to output.

For j_random_hacker, here is an implementation that will use very little memory (but will be a slower and requires more typing):

perl -lane 'BEGIN {dbmopen %h, "/tmp/$$", 0600; unlink "/tmp/$$.db" } print $F[0] unless $h{$F[0]}++' filename

dbmopen creates an interface between a DBM file (that it creates or opens) and the hash named %h. Anything stored in %h will be stored on disc instead of in memory. Deleting the file with unlink ensures that the file will not stick around after the program is done, but has no effect on the current process (since, according to POSIX rules, open filehandles are respected by the filesystem as real files).

Chas. Owens 2009-08-14 04:38:19

+1, a nice quick solution that works very well for small/medium input sizes. But it uses O(n) memory for n items, so for large input sizes, use Sinan's sorting approach instead.

j_random_hacker 2009-08-14 06:08:05

What makes you think it uses O(n) memory? It use O(m) memory where m is the number of unique items in the file. So, if the file had 10,000 items but only 3 unique values then it would only sort 3 items in the hash.

Chas. Owens 2009-08-14 06:18:10

@j_random_hacker Chas. is right. Using the shell utilities, the whole input is sorted first, and then unique elements are selected. Using the Perl method, input order is preserved (if that matters) and if sorting needs to be done, it needs to be done on a list no longer than the original input (and likely shorter if there are duplicates). So, `cat? + cut/sort/uniq` will probably trash sooner than the Perl solution will.

Sinan Ünür 2009-08-14 12:34:20

@Chas: You're right. I misspoke, I meant to say "n distinct items" instead of "n items." But the point remains that this algorithm uses much more memory than Sinan's sort-based approach when most items are unique (which is not some pathological case, rather fairly typical).

j_random_hacker 2009-08-16 03:54:29

@Sinan: No. "sort" uses an efficient external-memory sort algorithm -- the limit is *disk space*, not memory, so in effect it can sort arbitrarily large inputs without crashing. (By comparison I don't know of any efficient external-memory hashtable data structures -- but if you do, educate me.) Also, the "--stable" option can be used to sort stably.

j_random_hacker 2009-08-16 04:01:40

@j_random_hacker This algorithm is superior to sorting in every way. It takes `O(n)` time whereas the sort will take on average `O(n log n)` time and has a worst case of `O(n^2)` time. This algorithm needs `O(m)` space (where m is the number of unique items), the sort requires `O(n)` space. My implementation assumes that the number of unique items will fit in memory, so I am using a hash for performance reasons. If the number of unique items is large you can always use a tied DBM to move the hash to disc which gives you the same memory footprint of the sort.

Chas. Owens 2009-08-16 04:14:15

@j_random_hacker You have never heard of DBM files? We are talking old technology here ('79ish): http://en.wikipedia.org/wiki/Dbm Perl has an easy way of to make a DBM look like a normal hash. Take a look at dbmopen (the old DBM specific interface) and tied variables (the new generalized interface): http://perldoc.perl.org/functions/dbmopen.html http://perldoc.perl.org/AnyDBM_File.html

Chas. Owens 2009-08-16 04:18:03

@Chas: 1. I'm well aware of DBM and other B-tree-like data structures. You will find that using DBM *dramatically* slows performance of your hash-based approach, since disk accesses will take place in random order -- whereas external memory sorting algos always access the disk in sequential order. (That is what I meant by "efficient external-memory hashtable" -- efficient for the purposes of this task.)

j_random_hacker 2009-08-16 06:01:19

2. Your statement of O(n^2) worst-case time is true only for some sorting algorithms, such as quicksort. "sort" may well use the more efficient mergesort, which has *guaranteed O(nlog n) worst-case time* (it will definitely use this algorithm for the external sorting passes). Heapsort is another algorithm with *guaranteed O(nlog n) worst-case* time.

j_random_hacker 2009-08-16 06:03:15

3. Your statement that your approach takes O(n) time is only true if certain pathological input sequences (that would cause many hash collisions) do not occur. If they do occur, the hash effectively becomes a linked list, with each hash lookup taking O(n), for O(n^2) performance overall. In practice these inputs *are* extremely unlikely to occur -- just as sorting algorithms with worst-case O(n^2) time *are* extremely unlikely to be presented with inputs that produce those worst-case times.

j_random_hacker 2009-08-16 06:18:51

For a data set huge enough to cause memory problems the difference between `O(n)` and the average `O(n log n)` run times will easily make up for the cost of using a DBM. The other thing to remember is that you need a data set that is huge and highly unique to cause problems for this algorithm. A terabyte data set that has a gigabyte worth of unique data will cause more problems for sort than for this algorithm.

Chas. Owens 2009-08-16 06:59:15

I've never disputed that hashtables will be faster provided they fit in RAM. But sorting will be faster for sufficiently large n_distinct, because then almost every hash/DBM access causes a disk seek + disk read since it's not currently in RAM. When sorting, only sequential reads are used, so if 500 records fit in a disk block, on average a disk read is performed only on every 500th record access, and no seeks are needed. (SSDs eliminate the seek time but not the read time.)

j_random_hacker 2009-08-16 09:54:03

As you probably know, it pays to beware the hidden constants in Big Oh notation. E.g. for searching text, suffix arrays are always faster in practice than suffix trees, despite being O(nlog n) while the latter are O(n). The difference there, and with sorting-vs-hashing here, is in the nonlinear memory access patterns, which confound existing caching mechanisms.

j_random_hacker 2009-08-16 09:56:50

If the sort and the DBM were working with the same sized data, then yes the efficiencies of the sort's implementation may give it an edge even though its order of growth is better, but the hash/DBM gets to work with a subset of the data. So, in order for your assertion to be correct, we need to have a file that won't fit in memory (4 gig of ram is common now, so we are talking about something larger than that) that is very unique.

Chas. Owens 2009-08-16 16:00:33

I tested this empirically with a 373k file. This favors sort, because it gets to read the entire file in one read and the DBM must still write 2 ** 16 times. The runtimes for 100% unique was 1.9s for the DBM and 1.5s for the sort (which remains constant for all tests). So, the sort is roughly 26% faster. At 25% unique the time for the DBM drop down to 1.7s (13% faster), and at 6% unique they roughly the same. It is important to note that the in memory version runs the worst case scenario in 0.1s.

Chas. Owens 2009-08-16 16:18:45

I then took /usr/share/dict/words, cat'ed it twice into a shuffle and used it as a data file. The specs for the file are now: 469,873 items, 50% unique, 4.7m. Sort runs in 17.4s, DBM in 14.7s, and hash in 0.7s. As you can see, as the file grows in size, the inefficiency of the sort's order of growth starts to outweigh the efficiency of its implementation. This is whole point of big O notation. In a file large enough not to fit in memory (currently around 2 gig for most machines) I would expect the DBM solution to perform even better.

Chas. Owens 2009-08-16 17:28:51

Re your 1st response to my last comment: We're saying (almost) the same thing in different ways, I think. Re responses #2 and #3: A 373Kb or 4.7Mb file easily fits in memory (the former possibly even in L2 cache!), so I expect a hash to be MUCH faster here even at 100% uniqueness. Only when n_distinct gets so big that a hashtable containing all distinct items needs *several times the size of available RAM* does the hashing solution start to thrash horribly -- that's when sorting will be much faster. As you say, you would need a large (multigigabyte) file to see this behaviour.

j_random_hacker 2009-08-17 02:07:07

Earlier I said that it is only when n_distinct is sufficiently large that sorting will dominate. I should have qualified that further by saying that it is only when n_distinct is sufficiently large *and the ratio of n to n_distinct is sufficiently small* that sorting wins. (Even then it's trivial to modify mergesort to eliminate duplicates during each pass, which will dramatically boost sorting performance when there are only a few distinct items.) All I'm trying to do here is refute the claim you made in your 2nd comment that "This algorithm is superior to sorting in every way".

j_random_hacker 2009-08-17 02:21:56

With every exchange the set of circumstances in which using the UNIX sort command is better than using the Perl one-liner shrinks. You have gone so far this time as to propose a change in its behavior. I may have overreach slightly by claiming "This algorithm is superior to sorting in every way", so I now amend it to "This algorithm is superior to sorting in every way that matters practically"; however, the fact that there is a sweet spot where sort becomes, slightly, better than hashing __with these implementations__ has no bearing on the algorithms themselves.

Chas. Owens 2009-08-17 14:28:55

Interesting that you criticise me for altering my original claim, then proceed to do the same yourself. Anyway, for a ~9Gb file containing 1,000,000,000 pseudorandom decimal integers, 1 per line, each in the range 0-499,999,999 (i.e. approx 50% unique), on my 4Gb RAM Linux PC: Perl in-RAM hashing runs out of memory and died after 11 minutes; sort -u completed after 3hr 42min; using Perl dbmopen()+hash was still running after 8hr 40min when the sysadmin rebooted the machine.

j_random_hacker 2009-08-18 03:39:10

If you don't think that the case where both n and n_distinct/n are large (that is, where sorting dominates) "matters practically", then I suppose you don't think that analyzing web logs for unique visitors matters practically. Or eliminating duplicate DNA sequences in databases. Those are two applications that come to mind, the latter being one I have extensive experience with. Just to be absolutely clear, I have always said that the case where n or n_distinct/n is very small (which is where hashing dominates) is also very important -- but it's absurd to say that the other case is not.

j_random_hacker 2009-08-18 03:55:43

Answer 5

+2 A:

if using awk, no need to use other commands

awk '!_[$2]++{print $2}' file

ghostdog74 2009-08-14 13:40:34

if you want to pass the column number as a parameter instead of hardcoding it: awk -v col=2 '!_[$col]++ {print $col}' file

glenn jackman 2009-09-03 14:22:09

ansaurus

tags:

views:

answers:

What are some one-liners that can output unique elements of the nth column to another file?

related questions