ansaurus

Question

Split large file into smaller files by number of lines in C#?

Answer 1

+1 A:

I'd do it like this:

// helper method to break up into blocks lazily

public static IEnumerable<ICollection<T>> SplitEnumerable<T>
    (IEnumerable<T> Sequence, int NbrPerBlock)
{
    List<T> Group = new List<T>(NbrPerBlock);

    foreach (T value in Sequence)
    {
        Group.Add(value);

        if (Group.Count == NbrPerBlock)
        {
            yield return Group;
            Group = new List<T>(NbrPerBlock);
        }
    }

    if (Group.Any()) yield return Group; // flush out any remaining
}

// now it's trivial; if you want to make smaller files, just foreach
// over this and write out the lines in each block to a new file

public static IEnumerable<ICollection<string>> SplitFile(string filePath)
{
    return File.ReadLines(filePath).SplitEnumerable(20000);
}

Is that not sufficient for you? You mention moving from position to position,but I don't see why that's necessary.

mquander 2010-07-30 17:46:27

This works too!!!! Gosh. I love this place!

DDiVita 2010-07-30 18:55:44

Answer 2

+2 A:

int index=0;
var groups = from line in File.ReadLines("myfile.csv")
             group line by index++/20000 into g
             select g.AsEnumerable();
int file=0;
foreach (var group in groups)
        File.WriteAllLines((file++).ToString(), group.ToArray());

Hasan Khan 2010-07-30 17:46:43

You need to use `File.ReadLines` instead of `ReadAllLines` -- `ReadAllLines` reads it all into memory at once. Also, using `index` in the grouping function like that freaks me out.

mquander 2010-07-30 17:48:52

changed to ReadLines, thanks

Hasan Khan 2010-07-30 17:51:02

+1 that is a very interesting use of linq

David 2010-07-30 17:54:13

While this is indeed interesting, there are enough cases that you don't want to read an entire file into memory that I would at least add the stipulation that you need to know the files won't be too large if you're going to use this method..

Jimmy Hoffa 2010-07-30 18:03:53

Won't the grouping method collect everything regardless of whether you use ReadLines or ReadAllLines?

Lasse V. Karlsen 2010-07-30 18:17:42

I assume so, but with `ReadAllLines`, you'd have the whole thing in memory twice instead of once.

mquander 2010-07-30 18:47:11

Never thought about using LINQ. Nice!

DDiVita 2010-07-30 18:54:15

Answer 3

+2 A:

using (System.IO.StreamReader sr = new System.IO.StreamReader("path"))
{
    int fileNumber = 0;

    while (!sr.EndOfStream)
    {
        int count = 0;

        using (System.IO.StreamWriter sw = new System.IO.StreamWriter("other path" + ++fileNumber))
        {
            sw.AutoFlush = true;

            while (!sr.EndOfStream && ++count < 20000)
            {
                sw.WriteLine(sr.ReadLine());
            }
        }
    }
}

Jon B 2010-07-30 17:47:57

This seems the most straight forward to me, though for memory's sake I would flush the write buffer with each write possibly. if each line is 100 bytes, that makes 1000 lines 100k, and 20000 2Mb, not a ton of memory but an unnecesarry foot print..

Jimmy Hoffa 2010-07-30 18:06:14

@Jimmy - I added `AutoFlush = True`, which automatically flushes after each write.

Jon B 2010-07-30 18:16:10

AutoFlush is a bad idea on a StreamWriter as it will flush after every single character (I looked at the code). If you don't specify a buffer size when creating a StreamWriter it defaults to only 128 characters, but that's still better than no buffer at all.

Tergiver 2010-07-30 19:37:03

ansaurus

tags:

views:

answers:

Split large file into smaller files by number of lines in C#?

related questions