views:

101

answers:

1

I need to copy files from a set of CDs that have a lot of duplicate content, with each other, and with what's already on my hard disk. The file names of identical files are not the same, and are in sub-directories of different names. I want to copy non-duplicate files from the CD into a new directory on the hard disk. I don't care about the sub-directories - I will sort it out later - I just want the unique files.

I can't find software to do that - see my post at SuperUser http://superuser.com/questions/129944/software-to-copy-non-duplicate-files-from-cd-dvd

Someone at SuperUser suggested I write a script using GNU's "find" and the Win32 version of some checksum tools. I glanced at that, and have not done anything like that before. I'm hoping something exists that I can modify.

I found a good program to delete duplicates, Duplicate Cleaner (it compares checksums), but it won't help me here, as I'd have to copy all the CDs to disk, and each is probably about 80% duplicates, and I don't have room to do that - I'd have to cycle through a few at a time copying everything, then turning around and deleting 80% of it, working the hard drive a lot.

Thanks for any help.

A: 

I don't use Windows, but I'll give a suggestion: a combination of GNU find and a Lua script. For find you can try

find / -exec md5sum '{}' ';'

If your GNU software includes xargs the following will be equivalent but may be significantly faster:

find / -print0 | xargs -0 md5sum

This will give you a list of checksums and corresponding filenames. We'll throw away the filenames and keep the checksums:

#!/usr/bin/env lua

local checksums = {}

for l in io.lines() do
  local checksum, pathname = l:match('^(%S+)%s+(.*)$')
  checksums[checksum] = true
end

local cdfiles = assert(io.popen('find e:/ -print0 | xargs -0 md5sum'))

for l in cdfiles:lines() do
  local checksum, pathname = l:match('^(%S+)%s+(.*)$')
  if not checksums[checksum] then
    io.stderr:write('copying file ', pathname, '\n')
    os.execute('cp ' .. pathname .. ' c:/files/from/cd')
    checksums[checksum] = true
  end
end

You can then pipe the output from

find / -print0 | xargs -0 md5um

into this script.

There are a few problems:

  • If the filename has special characters, it will need to be quoted. I don't know the quoting conventions on Windows.

  • It would more efficient to write the checksums to disk rather than to run find all the time. You could try

    local csums = assert(io.open('/tmp/checksums', 'w'))
    for cs in pairs(checksums) do csums:write(cs, '\n') end
    csums:close()
    

    And then read checksums back in from the file using io.lines again.

I hope this is enough to get you started. You can download Lua from http://lua.org, and I recommend the superb book Programming in Lua (check out the previous edition free online).

Norman Ramsey