ansaurus

Question

Best way to determine uniqueness and repetition in a text file.

Answer 1

A:

I'm not sure there's a better solution than the one you posted: O(n log(n) + n). The fineal "sort -nr" that you mention isn't strictly necessary given the problem statement but makes the output easier to grok for humans.

I'd be very interested if someone could come up with a solution that's faster than this (in complexity). Of course, writing a special-purpose program to do this same thing would likely be faster than using sort and uniq.

slacy 2009-03-13 20:59:42

It's certainly possible to solve this in O(n) time with a simple linear scan plus a decent hash function, but that requires O(n) memory as well. I don't think you can do it in less than O(n) memory while still beating O(nlogn) time, but that's mainly intuition.

Tyler McHenry 2009-03-13 21:02:38

Answer 2

+6 A:

I tend to lean towards Perl when I have text-processing problems like this, especially since Perl is installed on most Unix systems. (You could probably do the same thing with awk, which is probably a bit more available.)

Something like this should do the trick:

#!/usr/bin/perl

while(<>) {
    chomp;
    $lines{$_}++;
}

print "Total unique lines: ", scalar(keys %lines), "\n";
foreach my $line (sort {$lines{$b} <=> $lines{$a}} keys %lines) {
    printf "%6d  %s\n", $lines{$line}, $line;
}

(You could do this as a one-liner, but broken out makes it easier to read.)

This requires O(n) memory for the hash keys, where n is the number of unique lines. Runtime efficiency depends on the hash lookup but will be somewhere between O(n) (if you have no hash collisions) and O(n*log n) (for a balanced tree). The final, optional sort might take O(n^2) in the worst case and might dominate the runtime if the number of unique lines is high.

Commodore Jaeger 2009-03-13 21:03:52

I'd swap $lines{$a} <=> $lines{$b} to $lines{$b} <=> $lines{$a} to get the most frequent lines first.

Osama ALASSIRY 2009-03-13 21:11:29

Essentially this algorithm loads the whole file into memory. If your O(n) >= RAM, then you'll be in real trouble as you start swapping. It also has O(m log(m)) for the final sort where m is the number of unique lines. The poster stated that the ratio of unique to non unique lines is 1000:1.

slacy 2009-03-13 21:44:10

@Osama: Thanks. I've updated the code.

Commodore Jaeger 2009-03-13 22:05:11

@slacy: The OP specified 20 million lines at 25 characters each. That's about 500 megabytes, which should be reasonable on any modern system. If there are only 200k unique lines, we'll need on the order of 5 megabytes for the hash. But you're right, if we start swapping my algorithm is toast.

Commodore Jaeger 2009-03-13 22:08:01

Answer 3

+1 A:

Ensure you do this before testing your sort and uniq solution:

export LC_ALL=C

It would be good if you could compare this and the perl solution time wise at least.

pixelbeat 2009-03-13 22:20:48

Answer 4

+2 A:

I'm assuming the risk to be considered off-topic and downvoted, but I must rant about this.

20 millions * 25 chars = 500000000 bytes (assuming you don't mean Unicode)

That's less than 500 MB of RAM. That is not a huge number for a modern computer.

Please don't complain about this being horribly inefficient memory and time-wise. The decision to store redundant data in a flat text file was inefficient and wrong.

Use a database (sqlite for example) instead of a flat file.

Use a table like

CREATE TABLE lines (line VARCHAR(25), occurences INTEGER)

to store unique lines and their occurence.

If it's not your app that generates this text file, complain to the developers about it!

Anonymous 2009-03-13 22:56:16

Answer 5

+1 A:

With awk (use nawk or /usr/xpg4/bin/awk on Solaris:

awk 'END {
  for (k in _)
    print k, _[k]
    }
{ _[$0]++ }
' infile

radoulov 2009-03-14 10:17:49

ansaurus

tags:

views:

answers:

Best way to determine uniqueness and repetition in a text file.

related questions