views:

1701

answers:

5

This is a question regarding Unix shell scripting (any shell), but any other "standard" scripting language solution would also be appreciated:

I have a directory full of files where the filenames are hash values like this:

fd73d0cf8ee68073dce270cf7e770b97
fec8047a9186fdcc98fdbfc0ea6075ee

These files have different original file types such as png, zip, doc, pdf etc.

Can anybody provide a script that would rename the files so they get their appropriate file extension, probably based on the output of the file command?

Answer:

J.F. Sebastian's script will work for both ouput of the filenames as well as the actual renaming.

+12  A: 

You can use

file -i filename

to get a MIME-type. You could potentially lookup the type in a list and then append an extension. You can find list of MIME-types and suggested file extensions on the net.

csl
That's what I'd do, but be careful---very careful---with the various text-like extensions (.txt, .c, ... ) as file uses some fairly loose heuristics to guess at these.
dmckee
+7  A: 

Following csl's response:

You can use

file -i filename

to get a MIME-type. You could potentially lookup the type in a list and then append an extension. You can find list of MIME-types and suggested file extensions on the net.

I'd suggest you write a script that takes the output of file -i filename, and returns an extension (split on spaces, find the '/', look up that term in a table file) in your language of choice - a few lines at most. Then you can do something like:

for f in *; do mv $f $f.`file -i $f | get_extension.py`; done

in bash, or throw that in a bash script. Or make the get_extension script bigger, but that makes it less useful next time you want the relevant extension.

2 Make available as open source.

3 Profit!

Phil H
+6  A: 

Here's mimetypes' version:

#!/usr/bin/env python
"""It is a `filename -> filename.ext` filter. 

   `ext` is mime-based.

"""
import fileinput
import mimetypes
import os
import sys
from subprocess import Popen, PIPE

if len(sys.argv) > 1 and sys.argv[1] == '--rename':
    do_rename = True
    del sys.argv[1]
else:
    do_rename = False    

for filename in (line.rstrip() for line in fileinput.input()):
    output, _ = Popen(['file', '-bi', filename], stdout=PIPE).communicate()
    mime = output.split(';', 1)[0].lower().strip()
    ext = mimetypes.guess_extension(mime, strict=False)
    if ext is None:
        ext = os.path.extsep + 'undefined'
    filename_ext = filename + ext
    print filename_ext
    if do_rename:
       os.rename(filename, filename_ext)

Example:

$ ls *.file? | python add-ext.py --rename
avi.file.avi
djvu.file.undefined
doc.file.dot
gif.file.gif
html.file.html
ico.file.obj
jpg.file.jpe
m3u.file.ksh
mp3.file.mp3
mpg.file.m1v
pdf.file.pdf
pdf.file2.pdf
pdf.file3.pdf
png.file.png
tar.bz2.file.undefined


Following @Phil H's response that follows @csl' response:

#!/usr/bin/env python
"""It is a `filename -> filename.ext` filter. 

   `ext` is mime-based.
"""
# Mapping of mime-types to extensions is taken form here:
# http://as3corelib.googlecode.com/svn/trunk/src/com/adobe/net/MimeTypeMap.as
mime2exts_list = [
    ["application/andrew-inset","ez"],
    ["application/atom+xml","atom"],
    ["application/mac-binhex40","hqx"],
    ["application/mac-compactpro","cpt"],
    ["application/mathml+xml","mathml"],
    ["application/msword","doc"],
    ["application/octet-stream","bin","dms","lha","lzh","exe","class","so","dll","dmg"],
    ["application/oda","oda"],
    ["application/ogg","ogg"],
    ["application/pdf","pdf"],
    ["application/postscript","ai","eps","ps"],
    ["application/rdf+xml","rdf"],
    ["application/smil","smi","smil"],
    ["application/srgs","gram"],
    ["application/srgs+xml","grxml"],
    ["application/vnd.adobe.apollo-application-installer-package+zip","air"],
    ["application/vnd.mif","mif"],
    ["application/vnd.mozilla.xul+xml","xul"],
    ["application/vnd.ms-excel","xls"],
    ["application/vnd.ms-powerpoint","ppt"],
    ["application/vnd.rn-realmedia","rm"],
    ["application/vnd.wap.wbxml","wbxml"],
    ["application/vnd.wap.wmlc","wmlc"],
    ["application/vnd.wap.wmlscriptc","wmlsc"],
    ["application/voicexml+xml","vxml"],
    ["application/x-bcpio","bcpio"],
    ["application/x-cdlink","vcd"],
    ["application/x-chess-pgn","pgn"],
    ["application/x-cpio","cpio"],
    ["application/x-csh","csh"],
    ["application/x-director","dcr","dir","dxr"],
    ["application/x-dvi","dvi"],
    ["application/x-futuresplash","spl"],
    ["application/x-gtar","gtar"],
    ["application/x-hdf","hdf"],
    ["application/x-javascript","js"],
    ["application/x-koan","skp","skd","skt","skm"],
    ["application/x-latex","latex"],
    ["application/x-netcdf","nc","cdf"],
    ["application/x-sh","sh"],
    ["application/x-shar","shar"],
    ["application/x-shockwave-flash","swf"],
    ["application/x-stuffit","sit"],
    ["application/x-sv4cpio","sv4cpio"],
    ["application/x-sv4crc","sv4crc"],
    ["application/x-tar","tar"],
    ["application/x-tcl","tcl"],
    ["application/x-tex","tex"],
    ["application/x-texinfo","texinfo","texi"],
    ["application/x-troff","t","tr","roff"],
    ["application/x-troff-man","man"],
    ["application/x-troff-me","me"],
    ["application/x-troff-ms","ms"],
    ["application/x-ustar","ustar"],
    ["application/x-wais-source","src"],
    ["application/xhtml+xml","xhtml","xht"],
    ["application/xml","xml","xsl"],
    ["application/xml-dtd","dtd"],
    ["application/xslt+xml","xslt"],
    ["application/zip","zip"],
    ["audio/basic","au","snd"],
    ["audio/midi","mid","midi","kar"],
    ["audio/mpeg","mp3","mpga","mp2"],
    ["audio/x-aiff","aif","aiff","aifc"],
    ["audio/x-mpegurl","m3u"],
    ["audio/x-pn-realaudio","ram","ra"],
    ["audio/x-wav","wav"],
    ["chemical/x-pdb","pdb"],
    ["chemical/x-xyz","xyz"],
    ["image/bmp","bmp"],
    ["image/cgm","cgm"],
    ["image/gif","gif"],
    ["image/ief","ief"],
    ["image/jpeg","jpg","jpeg","jpe"],
    ["image/png","png"],
    ["image/svg+xml","svg"],
    ["image/tiff","tiff","tif"],
    ["image/vnd.djvu","djvu","djv"],
    ["image/vnd.wap.wbmp","wbmp"],
    ["image/x-cmu-raster","ras"],
    ["image/x-icon","ico"],
    ["image/x-portable-anymap","pnm"],
    ["image/x-portable-bitmap","pbm"],
    ["image/x-portable-graymap","pgm"],
    ["image/x-portable-pixmap","ppm"],
    ["image/x-rgb","rgb"],
    ["image/x-xbitmap","xbm"],
    ["image/x-xpixmap","xpm"],
    ["image/x-xwindowdump","xwd"],
    ["model/iges","igs","iges"],
    ["model/mesh","msh","mesh","silo"],
    ["model/vrml","wrl","vrml"],
    ["text/calendar","ics","ifb"],
    ["text/css","css"],
    ["text/html","html","htm"],
    ["text/plain","txt","asc"],
    ["text/richtext","rtx"],
    ["text/rtf","rtf"],
    ["text/sgml","sgml","sgm"],
    ["text/tab-separated-values","tsv"],
    ["text/vnd.wap.wml","wml"],
    ["text/vnd.wap.wmlscript","wmls"],
    ["text/x-setext","etx"],
    ["video/mpeg","mpg","mpeg","mpe"],
    ["video/quicktime","mov","qt"],
    ["video/vnd.mpegurl","m4u","mxu"],
    ["video/x-flv","flv"],
    ["video/x-msvideo","avi"],
    ["video/x-sgi-movie","movie"],
    ["x-conference/x-cooltalk","ice"]]

#NOTE: take only the first extension
mime2ext = dict(x[:2] for x in mime2exts_list)

if __name__ == '__main__':
    import fileinput, os.path
    from subprocess import Popen, PIPE

    for filename in (line.rstrip() for line in fileinput.input()):
        output, _ = Popen(['file', '-bi', filename], stdout=PIPE).communicate()
        mime = output.split(';', 1)[0].lower().strip()
        print filename + os.path.extsep + mime2ext.get(mime, 'undefined')


Here's a snippet for old python's versions (not tested):

#NOTE: take only the first extension
mime2ext = {}
for x in mime2exts_list:
    mime2ext[x[0]] = x[1]

if __name__ == '__main__':
    import os
    import sys

    # this version supports only stdin (part of fileinput.input() functionality)
    lines = sys.stdin.read().split('\n')
    for line in lines:
        filename = line.rstrip()
        output = os.popen('file -bi ' + filename).read()        
        mime = output.split(';')[0].lower().strip()
        try: ext = mime2ext[mime]
        except KeyError:
             ext = 'undefined'
        print filename + '.' + ext

It should work on Python 2.3.5 (I guess).

J.F. Sebastian
When I try to run your script I get the following error: File "./get_extension.py", line 121 mime2ext = dict(x[:2] for x in mime2exts_list) ^SyntaxError: invalid syntaxThis is using Python 2.3.5Any idea why that is?
Palmin
Hmm... comments don't allow formatting, but I hope you get the idea. The error ^ is below the "for" statement
Palmin
It's a generator expression. It doesn't work on Python 2.3.5. I've tested it on Python 2.5.2. I will add snippets for Python 2.3.5
J.F. Sebastian
I've added the snippet for old Python version. It is not tested.
J.F. Sebastian
The second snippet almost works for Python 2.3.5, I had to add .rstrip() to the line "mime = output.split(';')[0].lower().rstrip()" because the mime-type has some trailing whitespace.
Palmin
Don't do this. Python has a builtin mimetypes module for this stuff. http://stackoverflow.com/questions/352837/how-to-add-file-extensions-based-on-file-type-on-linuxunix#354036
Jason Baker
I've added mimetypes' version
J.F. Sebastian
Thanks for your script, works great. You might want to edit your answer to include the bash-line that can be used to really rename the files.
Palmin
I've added '--rename' parameter to actually do renaming.
J.F. Sebastian
Deleted my post since this one pretty much covers it. :-)
Jason Baker
+2  A: 

Of course, it should be added that deciding on a MIME type just based on file(1) output can be very inaccurate/vague (what's "data" ?) or even completely incorrect...

Keltia
A: 

Agreeing with Keltia, and elaborating some on his answer:

Take care -- some filetypes may be problematic. JPEG2000, for example.
And others might return too much info given the "file" command without any option tags. The way to avoid this is to use "file -b" for a brief return of information.

BZT

silversleevesx