ansaurus

Question

How might I remove duplicate lines from a file?

Answer 1

A:

Sort and delete

Broken Link 2009-07-31 22:40:22

Who ever voted down should go and vote down for such a uncertain question.

Broken Link 2009-07-31 22:50:54

-1 for giving vague help . maybe the question is bad , but your answer is poor :(

n00ki3 2009-07-31 23:43:11

Answer 2

+6 A:

If you're on *nix, try running the following command:

sort <file name> | uniq

David Locke 2009-07-31 22:43:14

Or just sort -u

William Pursell 2009-07-31 22:44:57

Answer 3

+6 A:

On Unix/Linux, use the uniq command, as per David Locke's answer, or sort, as per William Pursell's comment.

If you need a Python script:

lines_seen = set() # holds lines already seen
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
    if line not in lines_seen: # not a duplicate
        outfile.write(line)
        lines_seen.add(line)
outfile.close()

Update: The sort/uniq combination will remove duplicates but return a file with the lines sorted, which may or may not be what you want. The Python script above won't reorder lines, but just drop duplicates. Of course, to get the script above to sort as well, just leave out the outfile.write(line) and instead, immediately after the loop, do outfile.writelines(sorted(lines_seen)).

Vinay Sajip 2009-07-31 22:46:20

You need to run sort before you run uniq because uniq will only remove lines if they're identical to the previous line.

David Locke 2009-07-31 22:54:14

Yes - I referred to your answer but didn't reiterate that it was sort followed by uniq.

Vinay Sajip 2009-07-31 23:08:32

+1 for this solution. One further enhancement might be to store the md5 sum of the line, and compare the current line's md5 sum. This should significantly cut down on the memory requirements. (see http://docs.python.org/library/md5.html)

joeslice 2009-07-31 23:18:42

Answer 4

+4 A:

uniqlines = set(open('/tmp/foo').readlines())

this will give you the list of unique lines.

writing that back to some file would be as easy as:

bar = open('/tmp/bar', 'w').writelines(set(uniqlines))

bar.close()

marcell 2009-08-01 12:51:25

True, but the lines will be in some random order according to how they hash.

Vinay Sajip 2009-08-01 15:42:39

what's the problem with lines not being sorted? regarding the question here...

marcell 2009-08-02 23:06:48

Answer 5

+1 A:

get all your lines in the list and make a set of lines and you are done. for example,

>>> x = ["line1","line2","line3","line2","line1"]
>>> list(set(x))
['line3', 'line2', 'line1']
>>>

and write the content back to the file.

Tumbleweed 2009-08-01 15:18:15

True, but the lines will be in some random order according to how they hash.

Vinay Sajip 2009-08-01 15:43:14

ansaurus

tags:

views:

answers:

How might I remove duplicate lines from a file?

related questions