ansaurus

Question

C# Remove Duplicate Lines From Text File?

Answer 1

+12 A:

For small files:

string[] lines = File.ReadAllLines("filename.txt");
File.WriteAllLines("filename.txt", lines.Distinct().ToArray());

Darin Dimitrov 2009-08-07 15:45:50

I wonder how it handles the .Distinct() on the T[].

sixlettervariables 2009-08-10 15:19:11

It looks like Distinct uses an internal Set class which appears to be a parred down HashSet class. Provided 'lines' isn't terribly large w.r.t. memory this should perform very well.

sixlettervariables 2009-08-10 15:23:52

Answer 2

+6 A:

This should do (and will copy with large files).

Note that it only removes duplicate consecutive lines, i.e.

a
b
b
c
b
d

will end up as

a
b
c
b
d

If you want no duplicates anywhere, you'll need to keep a set of lines you've already seen.

using System;
using System.IO;

class DeDuper
{
    static void Main(string[] args)
    {
        if (args.Length != 2)
        {
            Console.WriteLine("Usage: DeDuper <input file> <output file>");
            return;
        }
        using (TextReader reader = File.OpenText(args[0]))
        using (TextWriter writer = File.CreateText(args[1]))
        {
            string currentLine;
            string lastLine = null;

            while ((currentLine = reader.ReadLine()) != null)
            {
                if (currentLine != lastLine)
                {
                    writer.WriteLine(currentLine);
                    lastLine = currentLine;
                }
            }
        }
    }
}

Note that this assumes Encoding.UTF8, and that you want to use files. It's easy to generalize as a method though:

static void CopyLinesRemovingConsecutiveDupes
    (TextReader reader, TextWriter writer)
{
    string currentLine;
    string lastLine = null;

    while ((currentLine = reader.ReadLine()) != null)
    {
        if (currentLine != lastLine)
        {
            writer.WriteLine(currentLine);
            lastLine = currentLine;
        }
    }
}

(Note that that doesn't close anything - the caller should do that.)

Here's a version that will remove all duplicates, rather than just consecutive ones:

static void CopyLinesRemovingAllDupes(TextReader reader, TextWriter writer)
{
    string currentLine;
    HashSet<string> previousLines = new HashSet<string>();

    while ((currentLine = reader.ReadLine()) != null)
    {
        // Add returns true if it was actually added,
        // false if it was already there
        if (previousLines.Add(currentLine))
        {
            writer.WriteLine(currentLine);
        }
    }
}

Jon Skeet 2009-08-07 15:46:52

Answer 3

+1 A:

For a long file (and non consecutive duplications) I'd copy the files line by line building a hash // position lookup table as I went.

As each line is copied check for the hashed value, if there is a collision double check that the line is the same and move to the next. (

Only worth it for fairly large files though.

apocalypse9 2009-08-07 15:51:55

Answer 4

+1 A:

Here's a streaming approach that should incur less overhead than reading all unique strings into memory.

    var sr = new StreamReader(File.OpenRead(@"C:\Temp\in.txt"));
    var sw = new StreamWriter(File.OpenWrite(@"C:\Temp\out.txt"));
    var lines = new HashSet<int>();
    while (!sr.EndOfStream)
    {
        string line = sr.ReadLine();
        int hc = line.GetHashCode();
        if(lines.Contains(hc))
            continue;

        lines.Add(hc);
        sw.WriteLine(line);
    }
    sw.Flush();
    sw.Close();
    sr.Close();

Steve 2009-08-07 19:12:34

It requires less memory, but it also produces incorrect output if there's a hash collision.

Robert Rossney 2009-08-07 20:18:54

ansaurus

tags:

views:

answers:

C# Remove Duplicate Lines From Text File?

related questions