views:

1100

answers:

13

Problem Specification:

Given a directory, I want to iterate through the directory and its non-hidden sub-directories,
 and add a whirlpool hash into the non-hidden file's names.
If the script is re-run it would would replace an old hash with a new one.

<filename>.<extension>   ==>  <filename>.<a-whirlpool-hash>.<extension>

<filename>.<old-hash>.<extension>   ==>  <filename>.<new-hash>.<extension>


Question:

a) How would you do this?

b) Out of the all methods available to you, what makes your method most suitable?


Verdict:

Thanks all, I have chosen SeigeX's answer for it's speed and portability.
It is emprically quicker than the other bash variants,
 and it worked without alteration on my Mac OS X machine.

+1  A: 

You might want to store the results in one file, like in

find . -type f -exec md5sum {} \; > MD5SUMS

If you really want one file per hash:

find . -type f | while read f; do g=`md5sum $f` > $f.md5; done

or even

find . -type f | while read f; do g=`md5sum $f | awk '{print $1}'`; echo "$g $f"> $f-$g.md5; done
wallenborn
A: 

using zsh:

$ ls
a.txt
b.txt
c.txt

The magic:

$ FILES=**/*(.) 
$ # */ stupid syntax coloring thinks this is a comment
$ for f in $FILES; do hash=`md5sum $f | cut -f1 -d" "`; mv $f "$f:r.$hash.$f:e"; done
$ ls
a.60b725f10c9c85c70d97880dfe8191b3.txt
b.3b5d5c3712955042212316173ccf37be.txt
c.2cd6ee2c70b0bde53fbe6cac3c8b8bb1.txt

Happy deconstruction!

Edit: added files in subdirectories and quotes around mv argument

Otto Allmendinger
He asked for subdirs, too. Use **find . -type f -print|while read f** in lieu of _for f in *_
NVRAM
Oh, and he may need to quote the file names to handle spaces.
NVRAM
@NVRAM zsh can glob for files in subdirectoires with `**/*(.)`
Otto Allmendinger
+2  A: 
find . -type f -print | while read file
do
    hash=`$hashcommand "$file"`
    filename=${file%.*}
    extension=${file##*.}
    mv $file "$filename.$hash.$extension"
done
Joe Koberg
1. Doesn't handle filenames with spaces, and 2. it will try to rename directories, which will cause it to not find the files within those directories... Use **find . -type f -print|while read file** for the first line, then add quotes to the filenames on the **hash=** and **mv** lines.
NVRAM
Doesn't work for the reasons listed above, plus also doesn't work for files without extensions. I adapted your idea and made a new solution that fixes most of these problems.
Mark Byers
Thanks for the hints!
Joe Koberg
+1  A: 

In sh or bash, two versions. One limits itself to files with extensions...

hash () {
  #openssl md5 t.sh | sed -e 's/.* //'
  whirlpool "$f"
}

find . -type f -a -name '*.*' | while read f; do
  # remove the echo to run this for real
  echo mv "$f" "${f%.*}.whirlpool-`hash "$f"`.${f##*.}"
done

Testing...

...
mv ./bash-4.0/signames.h ./bash-4.0/signames.whirlpool-d71b117a822394a5b273ea6c0e3f4dc045b1098326d39864564f1046ab7bd9296d5533894626288265a1f70638ee3ecce1f6a22739b389ff7cb1fa48c76fa166.h
...

And this more complex version processes all plain files, with or without extensions, with or without spaces and odd characters, etc, etc...

hash () {
  #openssl md5 t.sh | sed -e 's/.* //'
  whirlpool "$f"
}

find . -type f | while read f; do
  name=${f##*/}
  case "$name" in
    *.*) extension=".${name##*.}" ;;
    *)   extension=   ;;
  esac
  # remove the echo to run this for real
  echo mv "$f" "${f%/*}/${name%.*}.whirlpool-`hash "$f"`$extension"
done
DigitalRoss
+4  A: 
#!/bin/bash
find -type f -print0 | while read -d $'\0' file
do
    md5sum=`md5sum "${file}" | sed -r 's/ .*//'`
    filename=`echo "${file}" | sed -r 's/\.[^./]*$//'`
    extension="${file:${#filename}}"
    filename=`echo "${filename}" | sed -r 's/\.md5sum-[^.]+//'`
    if [[ "${file}" != "${filename}.md5sum-${md5sum}${extension}" ]]; then
        echo "Handling file: ${file}"
        mv "${file}" "${filename}.md5sum-${md5sum}${extension}"
    fi
done
  • Tested on files containing spaces like 'a b'
  • Tested on files containing multiple extensions like 'a.b.c'
  • Tested with directories containing spaces and/or dots.
  • Tested on files containing no extension inside directories containing dots, such as 'a.b/c'
  • Updated: Now updates hashes if the file changes.

Key points:

  • Use of print0 piped to while read -d $'\0', to correctly handle spaces in file names.
  • md5sum can be replaced with your favourite hash function. The sed removes the first space and everything after it from the output of md5sum.
  • The base filename is extracted using a regular expression that finds the last period that isn't followed by another slash (so that periods in directory names aren't counted as part of the extension).
  • The extension is found by using a substring with starting index as the length of the base filename.
Mark Byers
For your first version: `filename=${file%.*}` ... `extension=${file##$filename}` ... `echo mv "$file" "$filename.$md5sum$extension"`
Dennis Williamson
I don't think your suggested change would help. It will fail for files without extensions in directories containing periods.
Mark Byers
If not there then where directory names and filenames are similar.
Dennis Williamson
This solution produces wrong filename on my test directory tree (for `'f^Jnewline.ext1'` file). See http://stackoverflow.com/questions/1841737/bash-hashing-multiple-files-recursively/1842682#1842682
J.F. Sebastian
This solution also fails to follow the Spec because it never re-hashes an already hashed file. If the file contents changes, the file needs to have its hash updated.
SiegeX
It followed the spec and more when I posted it. :) I think the spec must have changed.
Mark Byers
+1 Well that sucks, can't be blamed for that, can you.
SiegeX
+1  A: 

Here's my take on it, in bash. Features: skips non-regular files; correctly deals with files with weird characters (i.e. spaces) in their names; deals with extensionless filenames; skips already-hashed files, so it can be run repeatedly (although if files are modified between runs, it adds the new hash rather than replacing the old one). I wrote it using md5 -q as the hash function; you should be able to replace this with anything else, as long as it only outputs the hash, not something like filename => hash.

find -x . -type f -print0 | while IFS="" read -r -d $'\000' file; do
    hash="$(md5 -q "$file")" # replace with your favorite hash function
    [[ "$file" == *."$hash" ]] && continue # skip files that already end in their hash
    dirname="$(dirname "$file")"
    basename="$(basename "$file")"
    base="${basename%.*}"
    [[ "$base" == *."$hash" ]] && continue # skip files that already end in hash + extension
    if [[ "$basename" == "$base" ]]; then
            extension=""
    else
            extension=".${basename##*.}"
    fi
    mv "$file" "$dirname/$base.$hash$extension"
done
Gordon Davisson
+2  A: 

The logic of the requirements is complex enough to justify the use of Python instead of bash. It should provide a more readable, extensible, and maintainable solution.

#!/usr/bin/env python
import hashlib, os

def ishash(h, size):
    """Whether `h` looks like hash's hex digest."""
    if len(h) == size: 
        try:
            int(h, 16) # whether h is a hex number
            return True
        except ValueError:
            return False

for root, dirs, files in os.walk("."):
    dirs[:] = [d for d in dirs if not d.startswith(".")] # skip hidden dirs
    for path in (os.path.join(root, f) for f in files if not f.startswith(".")):
        suffix = hash_ = "." + hashlib.md5(open(path).read()).hexdigest()
        hashsize = len(hash_) - 1
        # extract old hash from the name; add/replace the hash if needed
        barepath, ext = os.path.splitext(path) # ext may be empty
        if not ishash(ext[1:], hashsize):
            suffix += ext # add original extension
            barepath, oldhash = os.path.splitext(barepath) 
            if not ishash(oldhash[1:], hashsize):
               suffix = oldhash + suffix # preserve 2nd (not a hash) extension
        else: # ext looks like a hash
            oldhash = ext
        if hash_ != oldhash: # replace old hash by new one
           os.rename(path, barepath+suffix)

Here's a test directory tree. It contains:

  • files without extension inside directories with a dot in their name
  • filename which already has a hash in it (test on idempotency)
  • filename with two extensions
  • newlines in names
$ tree a
a
|-- b
|   `-- c.d
|       |-- f
|       |-- f.ext1.ext2
|       `-- g.d41d8cd98f00b204e9800998ecf8427e
|-- c.ext^Mnewline
|   `-- f
`-- f^Jnewline.ext1

7 directories, 5 files

Result

$ tree a
a
|-- b
|   `-- c.d
|       |-- f.0bee89b07a248e27c83fc3d5951213c1
|       |-- f.ext1.614dd0e977becb4c6f7fa99e64549b12.ext2
|       `-- g.d41d8cd98f00b204e9800998ecf8427e
|-- c.ext^Mnewline
|   `-- f.0bee89b07a248e27c83fc3d5951213c1
`-- f^Jnewline.b6fe8bb902ca1b80aaa632b776d77f83.ext1

7 directories, 5 files

The solution works correctly for all cases.


Whirlpool hash is not in Python's stdlib, but there are both pure Python and C extensions that support it e.g., python-mhash.

To install it:

$ sudo apt-get install python-mhash

To use it:

import mhash

print mhash.MHASH(mhash.MHASH_WHIRLPOOL, "text to hash here").hexdigest()

Output: cbdca4520cc5c131fc3a86109dd23fee2d7ff7be56636d398180178378944a4f41480b938608ae98da7eccbf39a4c79b83a8590c4cb1bace5bc638fc92b3e653


Invoking whirlpooldeep in Python

from subprocess import PIPE, STDOUT, Popen

def getoutput(cmd):
    return Popen(cmd, stdout=PIPE, stderr=STDOUT).communicate()[0]

hash_ = getoutput(["whirlpooldeep", "-q", path]).rstrip()


git can provide with leverage for the problems that need to track set of files based on their hashes.

J.F. Sebastian
Home again, home again, jiggidy-jig! Gooood Evening, J.F!
_ande_turner_
@_ande_turner_: 1. you can compile it from source http://labix.org/python-mhash 2. Use pure Python whirlpool.py http://www.bjrn.se/code/whirlpoolpy.txt `import whirlpool; print Whirlpool("text to hash").hexdigest()`
J.F. Sebastian
3.Invoke `whirlpooldeep` from Python. I've added an example to the answer.
J.F. Sebastian
+1  A: 
Peter Cordes
A: 

Ruby:

#!/usr/bin/env ruby
require 'digest/md5'

Dir.glob('**/*') do |f|
  next unless File.file? f
  next if /\.md5sum-[0-9a-f]{32}/ =~ f
  md5sum = Digest::MD5.file f
  newname = "%s/%s.md5sum-%s%s" %
    [File.dirname(f), File.basename(f,'.*'), md5sum, File.extname(f)]
  File.rename f, newname
end

Handles filenames that have spaces, no extension, and that have already been hashed.

Ignores hidden files and directories — add File::FNM_DOTMATCH as the second argument of glob if that's desired.

jleedev
+1  A: 
Eirik Schwenke
In hashname(), your echo command doesn't quote any of the variables, so they're subject to word-splitting, and then echo joins its args with a single space. So it won't work on filenames with repeated whitespace, or whitespace other than " ".
Peter Cordes
and you're using bash-specific features (maybe just the array in mktest), so you need to say #!/bin/bash (or with env, as you're already doing). Nice job on including a test dataset, though.
Peter Cordes
Eirik Schwenke
hm, the reason this worked seems to be that the non-standard declare-syntax is simply ignored (with warning) by ksh.
Eirik Schwenke
+3  A: 
Peter Cordes
I copied this into `~/whirlpool-rename.pl`, made it executable, moved into my test folder, ran it, and it returned 'invalid top directory at /System/Library/Perl/5.10.0/File/Find.pm line 593.' I don't know how to check if the `libperl-digest-whirlpool` is installed.
_ande_turner_
put in a use Digest::Whirlpool; line. It will fail at that point if you don't have the package. In which case, you'll need to install it, probably using CPAN, or fink, if fink packages it. (The cpan command works on OS X)
Peter Cordes
As for the "invalid top directory at /System/.../File/Find.pm", that happens when you run it without any args. You're passing the empty list to find(). I forgot to make "." the default directory to recurse into. I'll edit that in.
Peter Cordes
Ok, try the new version. I also changed the use Digest; to use Digest::Whirlpool, so it will fail if Digest doesn't have whirlpool.And put in a comment about where you could link instead of rename.
Peter Cordes
Hmm, you could cp -al first, and then run this on the hardlink farm. That loses the advantage of not writing to disk when nothing changes, though, and it's probably cleaner to just handle it in the perl script.
Peter Cordes
You could obviously parse the output of a whirlpool program for each file, if you can't get CPAN working.
Peter Cordes
+1: for error handling
J.F. Sebastian
+1  A: 

In response to your updated question:

If anyone can comment on how I can avoid looking in hidden directories with my BASH Script, it would be much appreciated.

You can avoid hidden directories with find by using

find -name '.?*' -prune -o \( -type f -print0 \)

-name '.*' -prune will prune ".", and stop without doing anything. :/

I'd still recommend my Perl version, though. I updated it... You may still need to install Digest::Whirlpool from CPAN, though.

Peter Cordes
See my answer: http://stackoverflow.com/questions/1841737/hashing-multiple-files/1880234#1880234
SiegeX
+4  A: 

Updated to fix:
1. File names with '[' or ']' in their name (really, any character now. See comment)
2. Handling of md5sum when hashing a file with a backslash or newline in its name
3. Functionized hash-checking algo for modularity
4. Refactored hash-checking logic to remove double-negatives

#!/bin/bash
if (($# != 1)) || ! [[ -d "$1" ]]; then
    echo "Usage: $0 /path/to/directory"
    exit 1
fi

is_hash() {
 md5=${1##*.} # strip prefix
 [[ "$md5" == *[^[:xdigit:]]* || ${#md5} -lt 32 ]] && echo "$1" || echo "${1%.*}"
}

while IFS= read -r -d $'\0' file; do
    read hash junk < <(md5sum "$file")
    basename="${file##*/}"
    dirname="${file%/*}"
    pre_ext="${basename%.*}"
    ext="${basename:${#pre_ext}}"

    # File already hashed?
    pre_ext=$(is_hash "$pre_ext")
    ext=$(is_hash "$ext")

    mv "$file" "${dirname}/${pre_ext}.${hash}${ext}" 2> /dev/null

done < <(find "$1" -path "*/.*" -prune -o \( -type f -print0 \))

This code has the following benefits over other entries thus far

  • It is fully compliant with Bash versions 2.0.2 and beyond
  • No superfluous calls to other binaries like sed or grep; uses builtin parameter expansion instead
  • Uses process substitution for 'find' instead of a pipe, no sub-shell is made this way
  • Takes the directory to work on as an argument and does a sanity check on it
  • Uses $() rather than `` notation for command substitution, the latter is deprecated
  • Works with files with spaces
  • Works with files with newlines
  • Works with files with multiple extensions
  • Works with files with no extension
  • Does not traverse hidden directories
  • Does NOT skip pre-hashed files, it will recalculate the hash as per the spec

Test Tree

$ tree -a a
a
|-- .hidden_dir
|   `-- foo
|-- b
|   `-- c.d
|       |-- f
|       |-- g.5236b1ab46088005ed3554940390c8a7.ext
|       |-- h.d41d8cd98f00b204e9800998ecf8427e
|       |-- i.ext1.5236b1ab46088005ed3554940390c8a7.ext2
|       `-- j.ext1.ext2
|-- c.ext^Mnewline
|   |-- f
|   `-- g.with[or].ext
`-- f^Jnewline.ext

4 directories, 9 files 

Result

$ tree -a a
a
|-- .hidden_dir
|   `-- foo
|-- b
|   `-- c.d
|       |-- f.d41d8cd98f00b204e9800998ecf8427e
|       |-- g.d41d8cd98f00b204e9800998ecf8427e.ext
|       |-- h.d41d8cd98f00b204e9800998ecf8427e
|       |-- i.ext1.d41d8cd98f00b204e9800998ecf8427e.ext2
|       `-- j.ext1.d41d8cd98f00b204e9800998ecf8427e.ext2
|-- c.ext^Mnewline
|   |-- f.d41d8cd98f00b204e9800998ecf8427e
|   `-- g.with[or].d41d8cd98f00b204e9800998ecf8427e.ext
`-- f^Jnewline.d3b07384d113edec49eaa6238ad5ff00.ext

4 directories, 9 files
SiegeX
You only need the -d check because you need the path to include a /, right? Otherwise you could find "$@". If there were bare filenames as part of that, though, you'd get dirname="$file".
Peter Cordes
I had 3 files with "[" or "]" in their filenames, and things went awry upon rehashing.
_ande_turner_
@_ande_turner_: Thanks for catching that. Give it another shot, I think you'll be pleasantly surprised. I now use the offset parameter expansion to determine the extension, so the contents of the file name should no longer interfere. I also fixed a problem with 'md5sum' putting a '\' in the hash if the file has a newline or a backslash. This should now work with whatever you throw at it.
SiegeX
@Peter: No, I use the -d because if the user does not pass a directory as the argument, it is probably a mistake on their part. If this behavior is not desired, and you actually do want to run it on just one file, then you can safely remove '! [[ -d "$1" ]]' from the sanity check.
SiegeX
Very nice solution. I learned a lot of Bash tricks from reading this. Adding to bookmarks!
Mark Byers