views:

538

answers:

5

I have a file with one column. How to delete repeated lines in a file?

A: 

Sort and delete

Broken Link
Who ever voted down should go and vote down for such a uncertain question.
Broken Link
-1 for giving vague help . maybe the question is bad , but your answer is poor :(
n00ki3
+6  A: 

If you're on *nix, try running the following command:

sort <file name> | uniq
David Locke
Or just sort -u
William Pursell
+6  A: 

On Unix/Linux, use the uniq command, as per David Locke's answer, or sort, as per William Pursell's comment.

If you need a Python script:

lines_seen = set() # holds lines already seen
outfile = open(outfilename, "w")
for line in open(infilename, "r"):
    if line not in lines_seen: # not a duplicate
        outfile.write(line)
        lines_seen.add(line)
outfile.close()

Update: The sort/uniq combination will remove duplicates but return a file with the lines sorted, which may or may not be what you want. The Python script above won't reorder lines, but just drop duplicates. Of course, to get the script above to sort as well, just leave out the outfile.write(line) and instead, immediately after the loop, do outfile.writelines(sorted(lines_seen)).

Vinay Sajip
You need to run sort before you run uniq because uniq will only remove lines if they're identical to the previous line.
David Locke
Yes - I referred to your answer but didn't reiterate that it was sort followed by uniq.
Vinay Sajip
+1 for this solution. One further enhancement might be to store the md5 sum of the line, and compare the current line's md5 sum. This should significantly cut down on the memory requirements. (see http://docs.python.org/library/md5.html)
joeslice
+4  A: 
uniqlines = set(open('/tmp/foo').readlines())

this will give you the list of unique lines.

writing that back to some file would be as easy as:

bar = open('/tmp/bar', 'w').writelines(set(uniqlines))

bar.close()
marcell
True, but the lines will be in some random order according to how they hash.
Vinay Sajip
what's the problem with lines not being sorted? regarding the question here...
marcell
+1  A: 

get all your lines in the list and make a set of lines and you are done. for example,

>>> x = ["line1","line2","line3","line2","line1"]
>>> list(set(x))
['line3', 'line2', 'line1']
>>>

and write the content back to the file.

Tumbleweed
True, but the lines will be in some random order according to how they hash.
Vinay Sajip