views:

49

answers:

2

I have recovered some Word documents from a corrupted hard drive using a piece of software called photorec. The problem is that the documents' names can't be recovered; they are all renamed by a sequence of numbers. There are over 2000 documents to sort through and I was wondering if I could rename them using some automated process.

Is there a script I could use to find the first 10 letters in the document and rename it with that? It would have to be able to cope with multiple documents having the same first 10 letters and so not write over documents with the same name. Also, it would have to avoid renaming the document with illegal characters (such as '?', '*', '/', etc.)

I only have a little bit of experience with Python, C, and even less with bash programming in Linux, so bear with me if I don't know exactly what I'm doing if I have to write a new script.

+2  A: 

Word documents are stored in a custom format which places a load of binary cruft on the beginning of the file.

The simplest thing would be to knock something up in Python that searched for the first line beginning with ASCII chars. Here you go:

#!/usr/bin/python

import glob
import os

for file in glob.glob("*.doc"):
    f = open(file, "rb")
    new_name = ""
    chars = 0

    char = f.read(1)
    while char != "":
        if 0 < ord(char) < 128:
            if ord("a") <= ord(char) <= ord("z") or ord("A") <= ord(char) <= ord("Z") or ord("0") <= ord(char) <= ord("9"):
                new_name += char
            else:
                new_name += "_"
            chars += 1
            if chars == 100:
                new_name = new_name[:20] + ".doc"
                print "renaming " + file + " to " + new_name
                f.close()
                break;
        else:
            new_name = ""
            chars = 0
        char = f.read(1)

    if new_name != "":
        os.rename(file, new_name)

NOTE: if you want to glob multiple directories you'll need to change the glob line accordingly. Also this takes no account of whether the file you're trying to rename to already exists, so if you have multiple docs with the same first few chars then you'll need to handle that.

I found the first chunk of 100 ASCII chars in a row (if you look for less than that you end up picking up doc keywords and such) and then used the first 20 of these to make the new name, replacing anything that's not a-z A-Z or 0-9 with underscores to avoid file name issues.

Vicky
+3  A: 

How about VBScript? Here is a sketch:

FolderName = "C:\Docs\"

Set fs = CreateObject("Scripting.FileSystemObject")

Set fldr = fs.GetFolder(Foldername)

Set ws = CreateObject("Word.Application")

For Each f In fldr.Files
    If Left(f.name,2)<>"~$" Then
        If InStr(f.Type, "Microsoft Word") Then

        MsgBox f.Name

        Set doc = ws.Documents.Open(Foldername & f.Name)
        s = vbNullString
        i = 1
        Do While Trim(s) = vbNullString And i <= doc.Paragraphs.Count
            s = doc.Paragraphs(i)
            s = CleanString(Left(s, 10))
            i = i + 1
        Loop

        doc.Close False

        If s = "" Then s = "NoParas"
        s1 = s
        i = 1
        Do While fs.FileExists(s1)
            s1 = s & i
            i = i + 1
        Loop

        MsgBox "Name " & Foldername & f.Name & " As " & Foldername & s1 _
            & Right(f.Name, InStrRev(f.Name, "."))
        '' This uses copy, because it seems safer

            f.Copy Foldername & s1 & Right(f.Name, InStrRev(f.Name, ".")), False

            '' MoveFile will copy the file:
        '' fs.MoveFile Foldername & f.Name, Foldername & s1 _
        ''  & Right(f.Name, InStrRev(f.Name, "."))

        End If
    End If
Next

msgbox "Done"
ws.Quit
Set ws = Nothing
Set fs = Nothing

Function CleanString(StringToClean)
''http://msdn.microsoft.com/en-us/library/ms974570.aspx
Dim objRegEx 
Set objRegEx = CreateObject("VBScript.RegExp")
objRegEx.IgnoreCase = True
objRegEx.Global = True

''Find anything not a-z, 0-9
objRegEx.Pattern = "[^a-z0-9]"

CleanString = objRegEx.Replace(StringToClean, "")
End Function
Remou
Sorry it's taken me so long to reply. How do I execute this script? Do I save it as a .bat file or something?
Eddy
.vbs for vbscript. It is possible that vbscript has been disabled for policy reasons : http://technet.microsoft.com/en-us/library/ee198684.aspx
Remou