ansaurus

Question

[UNIX] Sort lines of massive file by number of words on line (ideally in parallel)

Answer 1

A:

awk '{print length,$0}' test.txt | sort -nr | cut -d" " -f2-

Not sure how well that will perform although sort can work around memory limits, AFAIK.

Nick Presta 2010-03-17 22:01:36

Answer 2

+7 A:

A naive approach could be simply:

awk '{ print NF " " $0 }' infile| sort -k1,1nr |
 awk '{ $1=""; print $0 }' >outfile

This will keep up to 3 CPUs busy. sort is not limited by the amount of physical memory available, use the -S and -T switches to configure how much memory to use (-S) before resorting to temporary files in a temp directory (-T) on a big enough (and ideally fast) partition.

If you can produce several input files by subdividing the work leading up to the sort phase, you would then be able to do:

for FILE in infile.* ; do
  awk '{ print NF " " $0 }' $FILE | sort -k1,1nr >$FILE.tmp&
done
wait
sort -k1,1nr -m infile.*.tmp | awk '{ $1=""; print $0 }' >outfile
rm -f infile.*.tmp

This will use up to N*2 CPUs; moreover, the last sort (merge-sort) is highly efficient.

Refining further to improve parallelism to N*2+1 by using FIFOs instead of intermediate files, again assuming multiple input files are possible:

for FILE in infile.* ; do
  mkfifo $FILE.fifo
  awk '{ print NF " " $0 }' $FILE | sort -k1,1nr >$FILE.fifo&
done
sort -k1,1nr -m infile.*.fifo | awk '{ $1=""; print $0 }' >outfile
rm -f infile.*.fifo

If multiple input files are not possible, you can simulate them (adding I/O overhead which will hopefully be amortized by the number of processes available):

PARALLELISM=5 # I want 5 parallel instances
for N in `seq $PARALLELISM` ; do
  mkfifo infile.$N.fifo
  awk 'NR % '$PARALLELISM'=='$N' { print NF " " $0 }' infile |
    sort -k1,1nr >infile.$N.fifo&
done
sort -k1,1nr -m infile.*.fifo | awk '{ $1=""; print $0 }' >outfile
rm -f infile.*.fifo

Because we use modulo-line-number we have good locality and the filesystem cache should ideally bring the cost of reading the input file over and over in $PARALLELISM processes closer to zero.

Even better, reading the input file only once and round-robin-ing input lines into several sort pipes:

PARALLELISM=5 # I want 5 parallel instances
for N in `seq $PARALLELISM` ; do
  mkfifo infile.$N.fifo1
  mkfifo infile.$N.fifo2
  sort -k1,1nr infile.$N.fifo1 >infile.$N.fifo2&
done
awk '{ print NF " " $0 >("infile." NR % '$PARALLELISM' ".fifo1") }' infile&
sort -k1,1nr -m infile.*.fifo2 | awk '{ $1=""; print $0 }' >outfile
rm -f infile.$N.fifo[12]

You should measure performance for various values of $PARALLELISM then pick the optimal one.

EDIT

As shown in other posts, you can of course use cut instead of the final awk (i.e. which strips the first column) for potentially better efficiency. :)

EDIT2

Updated all scripts for the filename convention you provided, and fixed a bug in the last version.

Also, using the new filename convention, if I/O is not the bottleneck then a very slight variation on dave/niry's solutions should probably be even more efficient:

   for FILE in infile.* ; do
     awk '{ print >sprintf("tmpfile.%05d.%s", NF, FILE) }' \
       FILE=`basename $FILE` $FILE&
   done
   wait
   ls -1r tmpfile.* | xargs cat >outfile
   rm -f tmpfile.*

Cheers, V.

vladr 2010-03-17 22:06:05

Great suggestions. I have not yet benchmarked all of these solutions side-by-side, but I can confirm that the second solution works pretty well. (It turns out that the lines I am trying to sort are in separate files, so this solution fits well.) I had less luck with the third solution, it seemed to use only two sort processes, and then only one. Also, I would like to report that the bottleneck is *not* disk I/O.

conradlee 2010-03-18 12:00:48

Interesting... the 3rd and 2nd solutions should have been quasi-equivalent (except the third should have been more efficient.) If I/O is not the bottleneck then a variation on `dave`/`niry`'s solutions should probably be even more efficient: `for FILE in INFILE.* ; do awk '{ print >sprintf("OUTFILE.%05d.%s", NF, FILE) }' FILE="`basename \"$FILE\"" "$FILE" done ; wait ; ls -1r OUTFILE.* | xargs cat >OUTFILE`

vladr 2010-03-18 17:05:24

Hmm, that sounds good, but because I'm a shell script novice (and the formatting in the comments is bad) I don't quite get where a couple of things fit together. Given that my files are labeled infile.001, infile.002, ... ,infile.011, ... ,file.n could you write up this script in your solution? If you do so, I will benchmark your various solutions against each other (with various numbers of processors).

conradlee 2010-03-18 17:31:01

You need to correct the last part of your last solution. Right now it says to ls -1r infile.*.tmp | xargs cat >outfilebut you need to perform this on OUTFILE* instead of infile.*.tmp.The same goes for the last line. This solution is indeed very quick (I'll post some benchmarks in a minute).

conradlee 2010-03-22 17:08:23

Corrected the last solution.

vladr 2010-03-22 19:05:22

Answer 3

A:

To create something efficient I would have done something like the following, a two pass parsing of the file:

In first pass read line by line, recording three things: line number, file offset and number of words. This could be paralellized without too much difficulty (for the jobs that starts at "random" lines within the file just add the corresponding start number afterwords).

Now sort the list of the three recorded things by number of words per line. Then iterate the list, seeking to the corresponding start offset.

From a performance point of view, all the seeking might be slow, but it should be relative light on memory consumption, only requiering 3 ints for each line.

hlovdal 2010-03-17 22:17:17

Answer 4

+4 A:

I wonder how quick this would be:

#!/bin/sh
rm -rf /tmp/fb
mkdir /tmp/fb
cd /tmp/fb
awk '{ print $0 > NF }'
ls | sort -nr | xargs cat

Doesn't take advantage of a lot of cores, though.

dave 2010-03-17 22:21:43

+1 probably fast enough :)

vladr 2010-03-17 22:25:23

BTW, use `ls -1` -- `ls -1r` to get rid of the sort too.

vladr 2010-03-17 22:38:08

+1 you need '>>' in the awk, '>' will truncate the file on every line

2010-03-17 22:38:29

@niry no, it will truncate only the first time the file is opened, `awk` will append on subsequent occurrences -- try it. :)

vladr 2010-03-17 22:40:41

yes, you are right.

2010-03-17 22:48:21

@Vlad ls -r won't work. It sorts alphabetically, so 2 would come before 10. Unless someone knows how to 0 pad NF in awk...

dave 2010-03-17 23:43:40

true... `sprintf("%05d", NF)` :)

vladr 2010-03-18 04:01:44

Answer 5

+1 A:

Since you don't really need to sort, just copy into buckets, you could split in files by number of tokens, this will be the fastest:

perl -ne 'split/\s+/;$t=$#_+1;open $f[$t], sprintf(">%09d",$t) if $f[$t] eq "";$f=$f[$t];print $f $_;'

cat `ls -1r 0*`

btw,the disk will be the bottleneck, # of cores and usage would not really matter.

2010-03-17 22:30:37

Not really different from dave's answer; you essentially read each line twice (once from the original file, once from the intermediate bin) and write it twice (once to the intermediate bin and once to the output file) without using the filesystem cache (i.e. memory) efficiently. A `sort` solution without intermediate files will theoretically make better use of memory even when the input file is much larger than the amount of available physical memory -- but both solutions would have to be timed. :)

vladr 2010-03-17 22:37:00

Completely agree. If you time it, let me know. If the file much larger than memory, sort might be much slower, it will need read/write the data several times, not just twice!

2010-03-17 22:45:36

Internally `sort` also uses bins when falling back to disk IIRC, so it should not be more than 2x. :)

vladr 2010-03-17 22:52:02

Answer 6

A:

Not sure I understood the question correctly, but I think a quicksort-like approach might help:

10 split the file into N subfiles, one for each core/cpu
20 sort each partial file using the solutions suggested in some of the answers here
30 once every file is split and sorted, get the first line from each file and put it into a temporary file
40 get the second line from each file, put them in a second temporary file
50 repeat until you have number of temp files == number of cores
60 GOTO 20

Depending on the number of passes, you should approach a perfectly sorted file.

Note that this is not a perfect solution. However, even in a couple of passes it should give you a reasonably well-sorted list of the longest lines in the first temporary file (I am assuming a Gaussian distribution of the length of the lines in the original long file).

ps: if the partial files are still bigger than the available memory split them again until they fit (depending on the sorting algorithm you're using for each file, tho). But in this case you'll need to double the number of passes to get a reasonable approximation

ps2: I also assume you're not interested in a perfectly sorted file but more in the statistical significance of the data (i.e. how long are long lines on average, etc).

lorenzog 2010-03-17 23:02:37

ansaurus

tags:

views:

answers:

[UNIX] Sort lines of massive file by number of words on line (ideally in parallel)

EDIT

EDIT2

related questions