tags:

views:

85

answers:

4

The txt is about 22,000 lines, and it's about 3.5MB. There are lots of duplicate lines in it. I simply want to remove the duplicate lines and also some lines which include specific strings not needed.

My way is to read the file into a big list using readlines() method, then read the file as a big string using read() method. Iterate the list, count occurrence, replace the line with ""(empty string). It took me 10 minutes to finish the job?!

Is there any fast way to do this?

Thanks a lot!

A: 
goodLines = set()
badString = 'bad string'

with open(inFilename, 'r') as f:
    for line in f:
        if badString not in line:
            goodLines.add(line)

# and let's output these lines (sorted, unique) in another file...

with open(outFilename, 'w') as f:
    f.writelines(sorted(goodLines))
eumiro
+3  A: 
list(set(line for line in file.readlines()
         if 'badstring' not in line
         and 'garbage' not in line))

Also, a regex might be faster than multiple not in tests.

Marcelo Cantos
(trivial edit necessary, sorry, but otherwise SO wouldn't have allowed my vote for some reason)
Tim Pietzcker
you don't need `readlines` there
SilentGhost
Well, I'm still a little bit confused. Would you please explain how it works? Thanks!
Shane
A: 

Is it a one off? If yes, just paste it into Excel and remove the duplicates there. :)

Vahe
A: 

I almost always do file processing using generators. This makes for code that's fast, easy to modify, and easy to test.

First, build a generator that removes duplicates:

def remove_duplicates(seq):
    found = set()
    for item in seq:
        if item in found:
            continue
        found.add(item)
        yield item

Does it work?

>>> print "\n".join(remove_duplicates(["aa", "bb", "cc", "aa"]))
aa
bb
cc

Apparently so. Next, create a function that tells you whether or not a line is OK:

def is_line_ok(line):
    if "bad text1" in line:
        return False
    if "bad text2" in line:
        return False
    return True

Does this work?

>>> is_line_ok("this line contains bad text2.")
False
>>> is_line_ok("this line's ok.")
True
>>> 

So now we can use remove_duplicates and itertools.ifilter with our function:

>>> seq = ["OK", "bad text2", "OK", "Also OK"]
>>> print "\n".join(remove_duplicates(ifilter(is_line_ok, seq)))
OK
Also OK

This method works on any iterable that returns strings, including files:

with open(input_file, 'r') as f_in:
    with open(output_file, 'w') as f_out:
       f_out.writelines(remove_duplicates(ifilter(is_line_ok, f_in)))
Robert Rossney
Thanks a lot for the details!
Shane