tags:

views:

100

answers:

3

I have a bunch of HTML files in HTML folder. Those HTML files have unicode characters which I solved by using filter(lambda x: x in string.printable, line). Now how do I write the changes back to the original file? What is the best way of doing it? Each HTML file is of 30 kb in size.

  1 import os, string
  2 
  3 for file in os.listdir("HTML/"):
  4     print file
  5     myfile = open('HTML/' + file)
  6     fileList = myfile.readlines()
  9     for line in fileList:
 10         #print line
 11         line = filter(lambda x: x in string.printable, line)
 12     myfile.close()
+2  A: 

At first I didn't understand what @~unutbu was getting at, but after reading the documentation for fileinput module I found this, which I hadn't seen before (emphasis mine):

Optional in-place filtering: if the keyword argument inplace=1 is passed to fileinput.input() or to the FileInput constructor, the file is moved to a backup file and standard output is directed to the input file (if a file of the same name as the backup file already exists, it will be replaced silently). This makes it possible to write a filter that rewrites its input file in place. If the backup parameter is given (typically as backup='.'), it specifies the extension for the backup file, and the backup file remains around; by default, the extension is '.bak' and it is deleted when the output file is closed. In-place filtering is disabled when standard input is read.

So I think his answer is best, and this explains why.

danben
Keep appending the lines? I don't want to lose any formatting. How do I do it? Thank you...
ThinkCode
Thanks for adding explanation
unutbu
+3  A: 

Use the fileinput module. It allows you to read and write to the same file in place:

import fileinput,sys,os
files=[os.path.join('HTML',filename) for filename in os.listdir("HTML/")]
for line in fileinput.input(files, inplace=True):    
    line = filter(lambda x: x in string.printable, line)
    sys.stdout.write(line)
unutbu
fileinput redirects sys.stdout to the file being operated upon.
unutbu
I tried this code (also have to import string, no biggie). All I have is a blank htm file at the end. What am I doing wrong?
ThinkCode
My bad, this code works. Sorry for the confusion. Thanks so much!
ThinkCode
A: 

This should work on Linux; support on other operative systems is iffy (see below).

import os, string

for file in os.listdir("HTML/"):
    print file
    myfile = open('HTML/' + file)
    fileList = myfile.readlines()
    for pos, line in enumerate(fileList):
        line = filter(lambda x: x in string.printable, line) # see note 1
        fileList[pos] = line                                 
    myfile.close()
    myfile = open('HTML/' + file, "wz") # see note 2
    myfile.write("\n".join(fileList))

Note 1. Simply assigning to line does not change fileList. Variables really are labels (references) onto objects: assigning to a label changes the object the label is attached to. That line creates a list which is then assigned

Note 2. The "wz" file mode empties the file on opening (it should be the equivalent of the O_TRUNC flag when passed to open() ). It might not be available on platforms other than Linux.

badp