views:

353

answers:

5

I have a text file of about 20 million lines. Each line is 25 characters long. I estimate that there are probably about 200k-300k unique lines. What I want to find out is exactly how many unique lines there are, and how many occurrences of each line there are (I expect the result to be power-law-esque).

I could do this:

sort bigfile|uniq -c |sort -nr > uniqcounts
wc -l uniqcounts

but that is horribly inefficient memory and time-wise.

What is your best command line solution to this problem?

A: 

I'm not sure there's a better solution than the one you posted: O(n log(n) + n). The fineal "sort -nr" that you mention isn't strictly necessary given the problem statement but makes the output easier to grok for humans.

I'd be very interested if someone could come up with a solution that's faster than this (in complexity). Of course, writing a special-purpose program to do this same thing would likely be faster than using sort and uniq.

slacy
It's certainly possible to solve this in O(n) time with a simple linear scan plus a decent hash function, but that requires O(n) memory as well. I don't think you can do it in less than O(n) memory while still beating O(nlogn) time, but that's mainly intuition.
Tyler McHenry
+6  A: 

I tend to lean towards Perl when I have text-processing problems like this, especially since Perl is installed on most Unix systems. (You could probably do the same thing with awk, which is probably a bit more available.)

Something like this should do the trick:

#!/usr/bin/perl

while(<>) {
    chomp;
    $lines{$_}++;
}

print "Total unique lines: ", scalar(keys %lines), "\n";
foreach my $line (sort {$lines{$b} <=> $lines{$a}} keys %lines) {
    printf "%6d  %s\n", $lines{$line}, $line;
}

(You could do this as a one-liner, but broken out makes it easier to read.)

This requires O(n) memory for the hash keys, where n is the number of unique lines. Runtime efficiency depends on the hash lookup but will be somewhere between O(n) (if you have no hash collisions) and O(n*log n) (for a balanced tree). The final, optional sort might take O(n^2) in the worst case and might dominate the runtime if the number of unique lines is high.

Commodore Jaeger
I'd swap $lines{$a} <=> $lines{$b} to $lines{$b} <=> $lines{$a} to get the most frequent lines first.
Osama ALASSIRY
Essentially this algorithm loads the whole file into memory. If your O(n) >= RAM, then you'll be in real trouble as you start swapping. It also has O(m log(m)) for the final sort where m is the number of unique lines. The poster stated that the ratio of unique to non unique lines is 1000:1.
slacy
@Osama: Thanks. I've updated the code.
Commodore Jaeger
@slacy: The OP specified 20 million lines at 25 characters each. That's about 500 megabytes, which should be reasonable on any modern system. If there are only 200k unique lines, we'll need on the order of 5 megabytes for the hash. But you're right, if we start swapping my algorithm is toast.
Commodore Jaeger
+1  A: 

Ensure you do this before testing your sort and uniq solution:

export LC_ALL=C

It would be good if you could compare this and the perl solution time wise at least.

pixelbeat
+2  A: 

I'm assuming the risk to be considered off-topic and downvoted, but I must rant about this.

20 millions * 25 chars = 500000000 bytes (assuming you don't mean Unicode)

That's less than 500 MB of RAM. That is not a huge number for a modern computer.

Please don't complain about this being horribly inefficient memory and time-wise. The decision to store redundant data in a flat text file was inefficient and wrong.

Use a database (sqlite for example) instead of a flat file.

Use a table like

CREATE TABLE lines (line VARCHAR(25), occurences INTEGER)

to store unique lines and their occurence.

If it's not your app that generates this text file, complain to the developers about it!

Anonymous
+1  A: 

With awk (use nawk or /usr/xpg4/bin/awk on Solaris:

awk 'END {
  for (k in _)
    print k, _[k]
    }
{ _[$0]++ }
' infile
radoulov