ansaurus

Question

Answer 1

+1 A:

You might want to store the results in one file, like in

find . -type f -exec md5sum {} \; > MD5SUMS

If you really want one file per hash:

find . -type f | while read f; do g=`md5sum $f` > $f.md5; done

or even

find . -type f | while read f; do g=`md5sum $f | awk '{print $1}'`; echo "$g $f"> $f-$g.md5; done

wallenborn 2009-12-03 18:08:59

Answer 2

A:

using zsh:

$ ls
a.txt
b.txt
c.txt

The magic:

$ FILES=**/*(.) 
$ # */ stupid syntax coloring thinks this is a comment
$ for f in $FILES; do hash=`md5sum $f | cut -f1 -d" "`; mv $f "$f:r.$hash.$f:e"; done
$ ls
a.60b725f10c9c85c70d97880dfe8191b3.txt
b.3b5d5c3712955042212316173ccf37be.txt
c.2cd6ee2c70b0bde53fbe6cac3c8b8bb1.txt

Happy deconstruction!

Edit: added files in subdirectories and quotes around mv argument

Otto Allmendinger 2009-12-03 18:09:01

He asked for subdirs, too. Use **find . -type f -print|while read f** in lieu of _for f in *_

NVRAM 2009-12-03 18:22:30

Oh, and he may need to quote the file names to handle spaces.

NVRAM 2009-12-03 18:23:52

@NVRAM zsh can glob for files in subdirectoires with `**/*(.)`

Otto Allmendinger 2009-12-03 18:53:04

Answer 3

+2 A:

find . -type f -print | while read file
do
    hash=`$hashcommand "$file"`
    filename=${file%.*}
    extension=${file##*.}
    mv $file "$filename.$hash.$extension"
done

Joe Koberg 2009-12-03 18:11:50

1. Doesn't handle filenames with spaces, and 2. it will try to rename directories, which will cause it to not find the files within those directories... Use **find . -type f -print|while read file** for the first line, then add quotes to the filenames on the **hash=** and **mv** lines.

NVRAM 2009-12-03 18:21:20

Doesn't work for the reasons listed above, plus also doesn't work for files without extensions. I adapted your idea and made a new solution that fixes most of these problems.

Mark Byers 2009-12-03 19:04:08

Thanks for the hints!

Joe Koberg 2009-12-03 19:28:17

Answer 4

+1 A:

In sh or bash, two versions. One limits itself to files with extensions...

hash () {
  #openssl md5 t.sh | sed -e 's/.* //'
  whirlpool "$f"
}

find . -type f -a -name '*.*' | while read f; do
  # remove the echo to run this for real
  echo mv "$f" "${f%.*}.whirlpool-`hash "$f"`.${f##*.}"
done

Testing...

...
mv ./bash-4.0/signames.h ./bash-4.0/signames.whirlpool-d71b117a822394a5b273ea6c0e3f4dc045b1098326d39864564f1046ab7bd9296d5533894626288265a1f70638ee3ecce1f6a22739b389ff7cb1fa48c76fa166.h
...

And this more complex version processes all plain files, with or without extensions, with or without spaces and odd characters, etc, etc...

hash () {
  #openssl md5 t.sh | sed -e 's/.* //'
  whirlpool "$f"
}

find . -type f | while read f; do
  name=${f##*/}
  case "$name" in
    *.*) extension=".${name##*.}" ;;
    *)   extension=   ;;
  esac
  # remove the echo to run this for real
  echo mv "$f" "${f%/*}/${name%.*}.whirlpool-`hash "$f"`$extension"
done

DigitalRoss 2009-12-03 18:20:34

Answer 5

+4 A:

#!/bin/bash
find -type f -print0 | while read -d $'\0' file
do
    md5sum=`md5sum "${file}" | sed -r 's/ .*//'`
    filename=`echo "${file}" | sed -r 's/\.[^./]*$//'`
    extension="${file:${#filename}}"
    filename=`echo "${filename}" | sed -r 's/\.md5sum-[^.]+//'`
    if [[ "${file}" != "${filename}.md5sum-${md5sum}${extension}" ]]; then
        echo "Handling file: ${file}"
        mv "${file}" "${filename}.md5sum-${md5sum}${extension}"
    fi
done

Tested on files containing spaces like 'a b'
Tested on files containing multiple extensions like 'a.b.c'
Tested with directories containing spaces and/or dots.
Tested on files containing no extension inside directories containing dots, such as 'a.b/c'
Updated: Now updates hashes if the file changes.

Key points:

Use of print0 piped to while read -d $'\0', to correctly handle spaces in file names.
md5sum can be replaced with your favourite hash function. The sed removes the first space and everything after it from the output of md5sum.
The base filename is extracted using a regular expression that finds the last period that isn't followed by another slash (so that periods in directory names aren't counted as part of the extension).
The extension is found by using a substring with starting index as the length of the base filename.

Mark Byers 2009-12-03 18:40:20

For your first version: `filename=${file%.*}` ... `extension=${file##$filename}` ... `echo mv "$file" "$filename.$md5sum$extension"`

Dennis Williamson 2009-12-03 19:07:23

I don't think your suggested change would help. It will fail for files without extensions in directories containing periods.

Mark Byers 2009-12-03 19:12:40

If not there then where directory names and filenames are similar.

Dennis Williamson 2009-12-03 19:55:26

This solution produces wrong filename on my test directory tree (for `'f^Jnewline.ext1'` file). See http://stackoverflow.com/questions/1841737/bash-hashing-multiple-files-recursively/1842682#1842682

J.F. Sebastian 2009-12-04 13:40:53

This solution also fails to follow the Spec because it never re-hashes an already hashed file. If the file contents changes, the file needs to have its hash updated.

SiegeX 2009-12-11 21:42:05

It followed the spec and more when I posted it. :) I think the spec must have changed.

Mark Byers 2009-12-11 21:47:11

+1 Well that sucks, can't be blamed for that, can you.

SiegeX 2009-12-12 10:05:50

Answer 6

+1 A:

Here's my take on it, in bash. Features: skips non-regular files; correctly deals with files with weird characters (i.e. spaces) in their names; deals with extensionless filenames; skips already-hashed files, so it can be run repeatedly (although if files are modified between runs, it adds the new hash rather than replacing the old one). I wrote it using md5 -q as the hash function; you should be able to replace this with anything else, as long as it only outputs the hash, not something like filename => hash.

find -x . -type f -print0 | while IFS="" read -r -d $'\000' file; do
    hash="$(md5 -q "$file")" # replace with your favorite hash function
    [[ "$file" == *."$hash" ]] && continue # skip files that already end in their hash
    dirname="$(dirname "$file")"
    basename="$(basename "$file")"
    base="${basename%.*}"
    [[ "$base" == *."$hash" ]] && continue # skip files that already end in hash + extension
    if [[ "$basename" == "$base" ]]; then
            extension=""
    else
            extension=".${basename##*.}"
    fi
    mv "$file" "$dirname/$base.$hash$extension"
done

Gordon Davisson 2009-12-03 19:01:18

Answer 7

+2 A:

The logic of the requirements is complex enough to justify the use of Python instead of bash. It should provide a more readable, extensible, and maintainable solution.

#!/usr/bin/env python
import hashlib, os

def ishash(h, size):
    """Whether `h` looks like hash's hex digest."""
    if len(h) == size: 
        try:
            int(h, 16) # whether h is a hex number
            return True
        except ValueError:
            return False

for root, dirs, files in os.walk("."):
    dirs[:] = [d for d in dirs if not d.startswith(".")] # skip hidden dirs
    for path in (os.path.join(root, f) for f in files if not f.startswith(".")):
        suffix = hash_ = "." + hashlib.md5(open(path).read()).hexdigest()
        hashsize = len(hash_) - 1
        # extract old hash from the name; add/replace the hash if needed
        barepath, ext = os.path.splitext(path) # ext may be empty
        if not ishash(ext[1:], hashsize):
            suffix += ext # add original extension
            barepath, oldhash = os.path.splitext(barepath) 
            if not ishash(oldhash[1:], hashsize):
               suffix = oldhash + suffix # preserve 2nd (not a hash) extension
        else: # ext looks like a hash
            oldhash = ext
        if hash_ != oldhash: # replace old hash by new one
           os.rename(path, barepath+suffix)

Here's a test directory tree. It contains:

files without extension inside directories with a dot in their name
filename which already has a hash in it (test on idempotency)
filename with two extensions
newlines in names

$ tree a
a
|-- b
|   `-- c.d
|       |-- f
|       |-- f.ext1.ext2
|       `-- g.d41d8cd98f00b204e9800998ecf8427e
|-- c.ext^Mnewline
|   `-- f
`-- f^Jnewline.ext1

7 directories, 5 files

Result

$ tree a
a
|-- b
|   `-- c.d
|       |-- f.0bee89b07a248e27c83fc3d5951213c1
|       |-- f.ext1.614dd0e977becb4c6f7fa99e64549b12.ext2
|       `-- g.d41d8cd98f00b204e9800998ecf8427e
|-- c.ext^Mnewline
|   `-- f.0bee89b07a248e27c83fc3d5951213c1
`-- f^Jnewline.b6fe8bb902ca1b80aaa632b776d77f83.ext1

7 directories, 5 files

The solution works correctly for all cases.

Whirlpool hash is not in Python's stdlib, but there are both pure Python and C extensions that support it e.g., python-mhash.

To install it:

$ sudo apt-get install python-mhash

To use it:

import mhash

print mhash.MHASH(mhash.MHASH_WHIRLPOOL, "text to hash here").hexdigest()

Output: cbdca4520cc5c131fc3a86109dd23fee2d7ff7be56636d398180178378944a4f41480b938608ae98da7eccbf39a4c79b83a8590c4cb1bace5bc638fc92b3e653

Invoking `whirlpooldeep` in Python

from subprocess import PIPE, STDOUT, Popen

def getoutput(cmd):
    return Popen(cmd, stdout=PIPE, stderr=STDOUT).communicate()[0]

hash_ = getoutput(["whirlpooldeep", "-q", path]).rstrip()

git can provide with leverage for the problems that need to track set of files based on their hashes.

J.F. Sebastian 2009-12-03 20:27:54

Home again, home again, jiggidy-jig! Gooood Evening, J.F!

_ande_turner_ 2009-12-04 03:20:20

@_ande_turner_: 1. you can compile it from source http://labix.org/python-mhash 2. Use pure Python whirlpool.py http://www.bjrn.se/code/whirlpoolpy.txt `import whirlpool; print Whirlpool("text to hash").hexdigest()`

J.F. Sebastian 2009-12-06 00:52:10

3.Invoke `whirlpooldeep` from Python. I've added an example to the answer.

J.F. Sebastian 2009-12-06 04:24:35

Answer 8

+1 A:

Peter Cordes 2009-12-03 20:36:40

Answer 9

A:

Ruby:

#!/usr/bin/env ruby
require 'digest/md5'

Dir.glob('**/*') do |f|
  next unless File.file? f
  next if /\.md5sum-[0-9a-f]{32}/ =~ f
  md5sum = Digest::MD5.file f
  newname = "%s/%s.md5sum-%s%s" %
    [File.dirname(f), File.basename(f,'.*'), md5sum, File.extname(f)]
  File.rename f, newname
end

Handles filenames that have spaces, no extension, and that have already been hashed.

Ignores hidden files and directories — add File::FNM_DOTMATCH as the second argument of glob if that's desired.

jleedev 2009-12-03 21:17:37

Answer 10

+1 A:

Eirik Schwenke 2009-12-04 14:41:28

In hashname(), your echo command doesn't quote any of the variables, so they're subject to word-splitting, and then echo joins its args with a single space. So it won't work on filenames with repeated whitespace, or whitespace other than " ".

Peter Cordes 2009-12-06 07:52:08

and you're using bash-specific features (maybe just the array in mktest), so you need to say #!/bin/bash (or with env, as you're already doing). Nice job on including a test dataset, though.

Peter Cordes 2009-12-06 07:55:42

Eirik Schwenke 2009-12-09 18:17:39

hm, the reason this worked seems to be that the non-standard declare-syntax is simply ignored (with warning) by ksh.

Eirik Schwenke 2009-12-09 18:49:33

Answer 11

+3 A:

Peter Cordes 2009-12-06 07:42:50

I copied this into `~/whirlpool-rename.pl`, made it executable, moved into my test folder, ran it, and it returned 'invalid top directory at /System/Library/Perl/5.10.0/File/Find.pm line 593.' I don't know how to check if the `libperl-digest-whirlpool` is installed.

_ande_turner_ 2009-12-06 13:26:06

put in a use Digest::Whirlpool; line. It will fail at that point if you don't have the package. In which case, you'll need to install it, probably using CPAN, or fink, if fink packages it. (The cpan command works on OS X)

Peter Cordes 2009-12-06 18:55:15

As for the "invalid top directory at /System/.../File/Find.pm", that happens when you run it without any args. You're passing the empty list to find(). I forgot to make "." the default directory to recurse into. I'll edit that in.

Peter Cordes 2009-12-06 18:56:55

Ok, try the new version. I also changed the use Digest; to use Digest::Whirlpool, so it will fail if Digest doesn't have whirlpool.And put in a comment about where you could link instead of rename.

Peter Cordes 2009-12-06 19:02:29

Hmm, you could cp -al first, and then run this on the hardlink farm. That loses the advantage of not writing to disk when nothing changes, though, and it's probably cleaner to just handle it in the perl script.

Peter Cordes 2009-12-06 19:03:57

You could obviously parse the output of a whirlpool program for each file, if you can't get CPAN working.

Peter Cordes 2009-12-08 20:09:50

+1: for error handling

J.F. Sebastian 2009-12-12 16:42:38

Answer 12

+1 A:

In response to your updated question:

If anyone can comment on how I can avoid looking in hidden directories with my BASH Script, it would be much appreciated.

You can avoid hidden directories with find by using

find -name '.?*' -prune -o \( -type f -print0 \)

-name '.*' -prune will prune ".", and stop without doing anything. :/

I'd still recommend my Perl version, though. I updated it... You may still need to install Digest::Whirlpool from CPAN, though.

Peter Cordes 2009-12-08 20:05:59

See my answer: http://stackoverflow.com/questions/1841737/hashing-multiple-files/1880234#1880234

SiegeX 2009-12-10 20:54:52

Answer 13

+4 A:

Updated to fix:
1. File names with '[' or ']' in their name (really, any character now. See comment)
2. Handling of md5sum when hashing a file with a backslash or newline in its name
3. Functionized hash-checking algo for modularity
4. Refactored hash-checking logic to remove double-negatives

#!/bin/bash
if (($# != 1)) || ! [[ -d "$1" ]]; then
    echo "Usage: $0 /path/to/directory"
    exit 1
fi

is_hash() {
 md5=${1##*.} # strip prefix
 [[ "$md5" == *[^[:xdigit:]]* || ${#md5} -lt 32 ]] && echo "$1" || echo "${1%.*}"
}

while IFS= read -r -d $'\0' file; do
    read hash junk < <(md5sum "$file")
    basename="${file##*/}"
    dirname="${file%/*}"
    pre_ext="${basename%.*}"
    ext="${basename:${#pre_ext}}"

    # File already hashed?
    pre_ext=$(is_hash "$pre_ext")
    ext=$(is_hash "$ext")

    mv "$file" "${dirname}/${pre_ext}.${hash}${ext}" 2> /dev/null

done < <(find "$1" -path "*/.*" -prune -o \( -type f -print0 \))

This code has the following benefits over other entries thus far

It is fully compliant with Bash versions 2.0.2 and beyond
No superfluous calls to other binaries like sed or grep; uses builtin parameter expansion instead
Uses process substitution for 'find' instead of a pipe, no sub-shell is made this way
Takes the directory to work on as an argument and does a sanity check on it
Uses $() rather than `` notation for command substitution, the latter is deprecated
Works with files with spaces
Works with files with newlines
Works with files with multiple extensions
Works with files with no extension
Does not traverse hidden directories
Does NOT skip pre-hashed files, it will recalculate the hash as per the spec

Test Tree

$ tree -a a
a
|-- .hidden_dir
|   `-- foo
|-- b
|   `-- c.d
|       |-- f
|       |-- g.5236b1ab46088005ed3554940390c8a7.ext
|       |-- h.d41d8cd98f00b204e9800998ecf8427e
|       |-- i.ext1.5236b1ab46088005ed3554940390c8a7.ext2
|       `-- j.ext1.ext2
|-- c.ext^Mnewline
|   |-- f
|   `-- g.with[or].ext
`-- f^Jnewline.ext

4 directories, 9 files

Result

$ tree -a a
a
|-- .hidden_dir
|   `-- foo
|-- b
|   `-- c.d
|       |-- f.d41d8cd98f00b204e9800998ecf8427e
|       |-- g.d41d8cd98f00b204e9800998ecf8427e.ext
|       |-- h.d41d8cd98f00b204e9800998ecf8427e
|       |-- i.ext1.d41d8cd98f00b204e9800998ecf8427e.ext2
|       `-- j.ext1.d41d8cd98f00b204e9800998ecf8427e.ext2
|-- c.ext^Mnewline
|   |-- f.d41d8cd98f00b204e9800998ecf8427e
|   `-- g.with[or].d41d8cd98f00b204e9800998ecf8427e.ext
`-- f^Jnewline.d3b07384d113edec49eaa6238ad5ff00.ext

4 directories, 9 files

SiegeX 2009-12-10 11:01:08

You only need the -d check because you need the path to include a /, right? Otherwise you could find "$@". If there were bare filenames as part of that, though, you'd get dirname="$file".

Peter Cordes 2009-12-10 21:12:20

I had 3 files with "[" or "]" in their filenames, and things went awry upon rehashing.

_ande_turner_ 2009-12-10 23:31:56

@_ande_turner_: Thanks for catching that. Give it another shot, I think you'll be pleasantly surprised. I now use the offset parameter expansion to determine the extension, so the contents of the file name should no longer interfere. I also fixed a problem with 'md5sum' putting a '\' in the hash if the file has a newline or a backslash. This should now work with whatever you throw at it.

SiegeX 2009-12-11 01:28:30

@Peter: No, I use the -d because if the user does not pass a directory as the argument, it is probably a mistake on their part. If this behavior is not desired, and you actually do want to run it on just one file, then you can safely remove '! [[ -d "$1" ]]' from the sanity check.

SiegeX 2009-12-11 01:46:02

Very nice solution. I learned a lot of Bash tricks from reading this. Adding to bookmarks!

Mark Byers 2009-12-13 21:49:58

ansaurus

tags:

views:

answers:

Hashing Multiple Files

Problem Specification:

Question:

a) How would you do this?

b) Out of the all methods available to you, what makes your method most suitable?

Verdict:

Result

Invoking `whirlpooldeep` in Python

Test Tree

Result

related questions

ansaurus

tags:

views:

answers:

Hashing Multiple Files

Problem Specification:

Question:

a) How would you do this?

b) Out of the all methods available to you, what makes your method most suitable?

Verdict:

Result

Invoking whirlpooldeep in Python

Test Tree

Result

related questions

Invoking `whirlpooldeep` in Python