tags:

views:

102

answers:

6

I have a bash script that processes all of the files in a directory using a loop like

for i in *.txt do ops..... done

There are thousands of files and they are always processed in alphanumerical order because of '*.txt' expansion.

Is there a simple way to random the order and still insure that I process all of the files only once?

+2  A: 

You could pipe your filenames through the sort command:

ls | sort --random-sort | xargs ....
tangens
which sort command are you using? /bin/sort has no such option
ennuikiller
I'm using `sort (GNU coreutils) 6.10`
tangens
GNU coreutils sort has it, though some versions had bugs where its effectiveness varied by locale.
ephemient
@tangens you should specify that in your answer. Not all unix/linux distros come with the gnu toolset.
ennuikiller
it's kinda funny that a "sort" utility has a random option!!
ennuikiller
@ennuikiller, for the answer I just looked through the man page of sort and I wasn't aware that this isn't standard. Sorry.
tangens
@ephemient: if you mean locales with differing collating rules, some or all of that behavior is "by design" (http://en.wikipedia.org/wiki/Collating_sequence). Can you point me to information regarding bugs that are not intended behavior?
Dennis Williamson
@Dennis I can't find their bug tracker but it looks like it was fixed by http://lists.gnu.org/archive/html/bug-coreutils/2006-08/msg00030.html -- what would happen is with `LC_COLLATE` (or `LANG` or `LC_ALL`) set to anything other than empty or `C` or `POSIX`, `sort -R` could be pretty darn deterministic.
ephemient
+1  A: 

Here's an answer that relies on very basic functions within awk so it should be portable between unices.

ls -1 | awk '{print rand()*100, $0}' | sort -n | awk '{print $2}'

EDIT:

ephemient makes a good point that the above is not space-safe. Here's a version that is:

ls -1 | awk '{print rand()*100, $0}' | sort -n | sed 's/[0-9\.]* //'
dustmachine
Breaks if any filenames contain whitespace.
ephemient
Hopefully nobody ever creates them, but embedded newlines in filenames will still trip this up.
ephemient
+3  A: 

Assuming the filenames do not have spaces, just substitute the output of List::Util::shuffle.

for i in `perl -MList::Util=shuffle -e'$,=$";print shuffle<*.txt>'`; do
    ....
done

If filenames do have spaces but don't have embedded newlines or backslashes, read a line at a time.

perl -MList::Util=shuffle -le'$,=$\;print shuffle<*.txt>' | while read i; do
    ....
done

To be completely safe in Bash, use NUL-terminated strings.

perl -MList::Util=shuffle -0 -le'$,=$\;print shuffle<*.txt>' |
while read -r -d '' i; do
    ....
done


Not very efficient, but it is possible to do this in pure Bash if desired. sort -R does something like this, internally.

declare -a a                     # create an integer-indexed associative array
for i in *.txt; do
    j=$RANDOM                    # find an unused slot
    while [[ -n ${a[$j]} ]]; do
        j=$RANDOM
    done
    a[$j]=$i                     # fill that slot
done
for i in "${a[@]}"; do           # iterate in index order (which is random)
    ....
done

Or use a traditional Fisher-Yates shuffle.

a=(*.txt)
for ((i=${#a[*]}; i>1; i--)); do
    j=$[RANDOM%i]
    tmp=${a[$j]}
    a[$j]=${a[$[i-1]]}
    a[$[i-1]]=$tmp
done
for i in "${a[@]}"; do
    ....
done
ephemient
It's not necessary to use dollar signs for array subscript variables or array subscript expressions: `a[j]=${a[i-1]}`. Also `man bash` says "The old format $[expression] is deprecated and will be removed in upcoming versions of bash." (in favor of `$(())` for example in your `j=$[RANDOM%i]`)
Dennis Williamson
+1  A: 

Here's a solution with standard unix commands:

for i in $(ls); do echo $RANDOM-$i; done | sort | cut -d- -f 2-
tangens
your solution doesn't work if filenames contain spaces
Dwight Kelly
changing `$(ls)` to `*` will let it work with spaces.
dustmachine
`ls`, `sort`, and `cut` aren't pure Bash commands. Also fails in the horrible case of filenames containing embedded newlines.
ephemient
See my updated answer for a couple pure-Bash shuffles, which incidentally handle odd filenames without problems.
ephemient
@ephemient: Thanks for that solution. I didn't know the Fisher-Yates shuffle before.
tangens
A: 

Here's a Python solution, if its available on your system

import glob
import random
files = glob.glob("*.txt")
if files:
    for file in random.shuffle(files):
        print file
To satisfy the original question, this wants a `if file.endswith('.txt')`. Or maybe you could turn it into something more generic like `shuf`...
ephemient
+1  A: 

If you have GNU coreutils, you can use shuf:

while read -d '' f
do
    # some stuff with $f
done < <(shuf -ze *)

This will work with files with spaces or newlines in their names.

Off-topic Edit:

To illustrate SiegeX's point in the comment:

$ a=42; echo "Don't Panic" | while read line; do echo $line; echo $a; a=0; echo $a; done; echo $a
Don't Panic
42
0
42
$ a=42; while read line; do echo $line; echo $a; a=0; echo $a; done < <(echo "Don't Panic"); echo $a
Don't Panic
42
0
0

The pipe causes the while to be executed in a subshell and so changes to variables in the child don't flow back to the parent.

Dennis Williamson
More specifically, coreutils≥6.1, I believe. Personally I'd prefer `shuf -ze * | while read` over `done < <(shuf -ze *)` but effectively they're the same.
ephemient
The reason I like `done < <()` is that it parallels `done < filename` (which avoids using `cat` unnecessarily).
Dennis Williamson
@ephemient The process substitution way `< <(foo)` has the added benefit that it does not create a sub-shell like the pipe method does.
SiegeX