ansaurus

Question

What's the fastest way to remove duplicate lines in a txt file(and also some lines which contain specific strings) using python?

Answer 1

A:

goodLines = set()
badString = 'bad string'

with open(inFilename, 'r') as f:
    for line in f:
        if badString not in line:
            goodLines.add(line)

# and let's output these lines (sorted, unique) in another file...

with open(outFilename, 'w') as f:
    f.writelines(sorted(goodLines))

eumiro 2010-10-22 12:00:16

Answer 2

+3 A:

list(set(line for line in file.readlines()
         if 'badstring' not in line
         and 'garbage' not in line))

Also, a regex might be faster than multiple not in tests.

Marcelo Cantos 2010-10-22 12:00:44

(trivial edit necessary, sorry, but otherwise SO wouldn't have allowed my vote for some reason)

Tim Pietzcker 2010-10-22 12:11:36

you don't need `readlines` there

SilentGhost 2010-10-22 12:12:03

Well, I'm still a little bit confused. Would you please explain how it works? Thanks!

Shane 2010-10-22 12:16:38

Answer 3

A:

Is it a one off? If yes, just paste it into Excel and remove the duplicates there. :)

Vahe 2010-10-22 12:51:53

Answer 4

A:

I almost always do file processing using generators. This makes for code that's fast, easy to modify, and easy to test.

First, build a generator that removes duplicates:

def remove_duplicates(seq):
    found = set()
    for item in seq:
        if item in found:
            continue
        found.add(item)
        yield item

Does it work?

>>> print "\n".join(remove_duplicates(["aa", "bb", "cc", "aa"]))
aa
bb
cc

Apparently so. Next, create a function that tells you whether or not a line is OK:

def is_line_ok(line):
    if "bad text1" in line:
        return False
    if "bad text2" in line:
        return False
    return True

Does this work?

>>> is_line_ok("this line contains bad text2.")
False
>>> is_line_ok("this line's ok.")
True
>>>

So now we can use remove_duplicates and itertools.ifilter with our function:

>>> seq = ["OK", "bad text2", "OK", "Also OK"]
>>> print "\n".join(remove_duplicates(ifilter(is_line_ok, seq)))
OK
Also OK

This method works on any iterable that returns strings, including files:

with open(input_file, 'r') as f_in:
    with open(output_file, 'w') as f_out:
       f_out.writelines(remove_duplicates(ifilter(is_line_ok, f_in)))

Robert Rossney 2010-10-22 22:04:41

Thanks a lot for the details!

Shane 2010-10-23 01:49:14

ansaurus

tags:

views:

answers:

What's the fastest way to remove duplicate lines in a txt file(and also some lines which contain specific strings) using python?

related questions