I almost always do file processing using generators. This makes for code that's fast, easy to modify, and easy to test.
First, build a generator that removes duplicates:
def remove_duplicates(seq):
found = set()
for item in seq:
if item in found:
continue
found.add(item)
yield item
Does it work?
>>> print "\n".join(remove_duplicates(["aa", "bb", "cc", "aa"]))
aa
bb
cc
Apparently so. Next, create a function that tells you whether or not a line is OK:
def is_line_ok(line):
if "bad text1" in line:
return False
if "bad text2" in line:
return False
return True
Does this work?
>>> is_line_ok("this line contains bad text2.")
False
>>> is_line_ok("this line's ok.")
True
>>>
So now we can use remove_duplicates
and itertools.ifilter
with our function:
>>> seq = ["OK", "bad text2", "OK", "Also OK"]
>>> print "\n".join(remove_duplicates(ifilter(is_line_ok, seq)))
OK
Also OK
This method works on any iterable that returns strings, including files:
with open(input_file, 'r') as f_in:
with open(output_file, 'w') as f_out:
f_out.writelines(remove_duplicates(ifilter(is_line_ok, f_in)))