tags:

views:

344

answers:

3

Hi All, I have a tar file which has number of files within it. I need to write a python script which will read the contents of the files and gives the count o total characters, including total number of letters, spaces, newline characters, everything, without untarring the tar file.

A: 
tar tvf file.tar

will list the contents

jspcal
The OP wants to write a Python script.
Amit
you can use os.system to run a cmd. up to the OP if he wants to use that or a mod
jspcal
@jspcal. while its entirely up to OP, its still better not to call system commands unnecessarily, first, due to portability of your code, secondly, Python has module for it already.
ghostdog74
up to the OP's judgement to use the right tool for the job. the py module doesn't support certain tar features for instance (which the vast majority of cases won't need). or in a case of frequently grepping for a path regexp in a 2TB .tbz file (fairly common type of backup search), might be better suited to using the shell. in many cases, the tarfile module is a much superior option, but it is not a total replacement necessarily for other methods.
jspcal
It lists only the top level contents. Is there any option to list the contents recursively
Naga Kiran
+2  A: 

you need to use the tarfile module. Specifically, you use an instance of the class TarFile to access the file, and then access the names with TarFile.getnames()

 |  getnames(self)
 |      Return the members of the archive as a list of their names. It has
 |      the same order as the list returned by getmembers().

If instead you want to read the content, then you use this method

 |  extractfile(self, member)
 |      Extract a member from the archive as a file object. `member' may be
 |      a filename or a TarInfo object. If `member' is a regular file, a
 |      file-like object is returned. If `member' is a link, a file-like
 |      object is constructed from the link's target. If `member' is none of
 |      the above, None is returned.
 |      The file-like object is read-only and provides the following
 |      methods: read(), readline(), readlines(), seek() and tell()
Stefano Borini
+1  A: 

you can use getmembers()

>>> import  tarfile
>>> tar = tarfile.open("test.tar")
>>> tar.getmembers()

After that, you can use extractfile() to extract the members as file object. Just an example

import tarfile,os
os.chdir("/tmp/foo")
tar = tarfile.open("test.tar")
for member in tar.getmembers():
    f=tar.extractfile(member)
    content=f.read()
    print "%s has %d newlines" %(member, content.count("\n"))
    print "%s has %d spaces" % (member,content.count(" "))
    print "%s has %d characters" % (member, len(content))
    sys.exit()
tar.close()

With the file object "f" in the above example, you can use read(), readlines() etc.

ghostdog74
thanks a lot ghostdog74. this is what i was looking for.
randeepsp