awk '{print length,$0}' test.txt | sort -nr | cut -d" " -f2-
Not sure how well that will perform although sort can work around memory limits, AFAIK.
awk '{print length,$0}' test.txt | sort -nr | cut -d" " -f2-
Not sure how well that will perform although sort can work around memory limits, AFAIK.
A naive approach could be simply:
awk '{ print NF " " $0 }' infile| sort -k1,1nr |
awk '{ $1=""; print $0 }' >outfile
This will keep up to 3 CPUs busy. sort
is not limited by the amount of physical memory available, use the -S
and -T
switches to configure how much memory to use (-S
) before resorting to temporary files in a temp directory (-T
) on a big enough (and ideally fast) partition.
If you can produce several input files by subdividing the work leading up to the sort phase, you would then be able to do:
for FILE in infile.* ; do
awk '{ print NF " " $0 }' $FILE | sort -k1,1nr >$FILE.tmp&
done
wait
sort -k1,1nr -m infile.*.tmp | awk '{ $1=""; print $0 }' >outfile
rm -f infile.*.tmp
This will use up to N*2
CPUs; moreover, the last sort (merge-sort) is highly efficient.
Refining further to improve parallelism to N*2+1
by using FIFOs instead of intermediate files, again assuming multiple input files are possible:
for FILE in infile.* ; do
mkfifo $FILE.fifo
awk '{ print NF " " $0 }' $FILE | sort -k1,1nr >$FILE.fifo&
done
sort -k1,1nr -m infile.*.fifo | awk '{ $1=""; print $0 }' >outfile
rm -f infile.*.fifo
If multiple input files are not possible, you can simulate them (adding I/O overhead which will hopefully be amortized by the number of processes available):
PARALLELISM=5 # I want 5 parallel instances
for N in `seq $PARALLELISM` ; do
mkfifo infile.$N.fifo
awk 'NR % '$PARALLELISM'=='$N' { print NF " " $0 }' infile |
sort -k1,1nr >infile.$N.fifo&
done
sort -k1,1nr -m infile.*.fifo | awk '{ $1=""; print $0 }' >outfile
rm -f infile.*.fifo
Because we use modulo-line-number we have good locality and the filesystem cache should ideally bring the cost of reading the input file over and over in $PARALLELISM
processes closer to zero.
Even better, reading the input file only once and round-robin-ing input lines into several sort
pipes:
PARALLELISM=5 # I want 5 parallel instances
for N in `seq $PARALLELISM` ; do
mkfifo infile.$N.fifo1
mkfifo infile.$N.fifo2
sort -k1,1nr infile.$N.fifo1 >infile.$N.fifo2&
done
awk '{ print NF " " $0 >("infile." NR % '$PARALLELISM' ".fifo1") }' infile&
sort -k1,1nr -m infile.*.fifo2 | awk '{ $1=""; print $0 }' >outfile
rm -f infile.$N.fifo[12]
You should measure performance for various values of $PARALLELISM
then pick the optimal one.
As shown in other posts, you can of course use cut
instead of the final awk
(i.e. which strips the first column) for potentially better efficiency. :)
Updated all scripts for the filename convention you provided, and fixed a bug in the last version.
Also, using the new filename convention, if I/O is not the bottleneck then a very slight variation on dave
/niry
's solutions should probably be even more efficient:
for FILE in infile.* ; do
awk '{ print >sprintf("tmpfile.%05d.%s", NF, FILE) }' \
FILE=`basename $FILE` $FILE&
done
wait
ls -1r tmpfile.* | xargs cat >outfile
rm -f tmpfile.*
Cheers, V.
To create something efficient I would have done something like the following, a two pass parsing of the file:
In first pass read line by line, recording three things: line number, file offset and number of words. This could be paralellized without too much difficulty (for the jobs that starts at "random" lines within the file just add the corresponding start number afterwords).
Now sort the list of the three recorded things by number of words per line. Then iterate the list, seeking to the corresponding start offset.
From a performance point of view, all the seeking might be slow, but it should be relative light on memory consumption, only requiering 3 ints for each line.
I wonder how quick this would be:
#!/bin/sh
rm -rf /tmp/fb
mkdir /tmp/fb
cd /tmp/fb
awk '{ print $0 > NF }'
ls | sort -nr | xargs cat
Doesn't take advantage of a lot of cores, though.
Since you don't really need to sort, just copy into buckets, you could split in files by number of tokens, this will be the fastest:
perl -ne 'split/\s+/;$t=$#_+1;open $f[$t], sprintf(">%09d",$t) if $f[$t] eq "";$f=$f[$t];print $f $_;'
cat `ls -1r 0*`
btw,the disk will be the bottleneck, # of cores and usage would not really matter.
Not sure I understood the question correctly, but I think a quicksort-like approach might help:
10 split the file into N subfiles, one for each core/cpu
20 sort each partial file using the solutions suggested in some of the answers here
30 once every file is split and sorted, get the first line from each file and put it into a temporary file
40 get the second line from each file, put them in a second temporary file
50 repeat until you have number of temp files == number of cores
60 GOTO 20
Depending on the number of passes, you should approach a perfectly sorted file.
Note that this is not a perfect solution. However, even in a couple of passes it should give you a reasonably well-sorted list of the longest lines in the first temporary file (I am assuming a Gaussian distribution of the length of the lines in the original long file).
ps: if the partial files are still bigger than the available memory split them again until they fit (depending on the sorting algorithm you're using for each file, tho). But in this case you'll need to double the number of passes to get a reasonable approximation
ps2: I also assume you're not interested in a perfectly sorted file but more in the statistical significance of the data (i.e. how long are long lines on average, etc).