ansaurus

Question

Answer 1

+3 A:

Have you considered loading the data directly into SQL Server and then manipulating it in the database? The database engine is already designed to perform manipulation of large volumes of data in an efficient manner. This may yield better results overall and allows you to leverage the capabilities of the database and SQL language to do the heavy lifting. It's the old "work smarter not harder" principle.

There are a number of different methods to load data into SQL Server, so you may want to examine these to see if any are a good fit. If you are using SQLServer 2005 or later and you really need to do some manipulation on the data in C#, you can always use a managed stored procedure.

Something to realize here is that the OutOfMemoryException is a bit misleading. Memory is more than just the amount of physical RAM you have. What you are likely running out of is addressable memory. This is a very different thing.

When you load a large file into memory and transform it into a DataTable it likely requires a lot more than just 800Mb to represent the same data. Since 32bit .NET processes are limited to just under 2Gb of addressable memory, you will likely never be able to process this quantity of data in a single batch.

What you will likely need to do is to process the data in a streaming manner. In other words, don't try to load it all into a DataTable and then bulk insert to SQLServer. Rather process the file in chunks, clearing out the prior set of rows once you're done with them.

Now, if you have access to a 64-bit machine with lots of memory (to avoid VM thrashing) and a copy of the 64-bit .NET runtime, you could probably get away within running the code unchanged. But I would suggest making the necessary changes anyways since it will likely improve the performance of this even in that environment.

LBushkin 2010-09-28 20:32:14

"What you will likely need to do is to process the data in a streaming manner"Great - this is what I think I would like to do, but I was having trouble finding an example of how to do this. This sounds similar to the other suggestion below and is a similar answer to other questions like mine I've found on SO, but none of the answers I've found so far has an example..

tt2 2010-09-28 21:32:53

800Mb text is 1.6Gb(+ overhead) as UTF-16 which is what C# strings uses, so there goes all the address space.

nos 2010-09-28 21:33:37

@tt2: See [Thomas Levesque's answer](http://stackoverflow.com/questions/3816789/read-from-streamreader-in-batches-c/3817179#3817179) for an example of a streaming solution.

LBushkin 2010-09-28 21:37:59

The link on addressable memory was a good read, thank you for sharing.

tt2 2010-09-28 21:51:04

Answer 2

+1 A:

SqlBulkCopy.WriteToServer has an overload that accepts an IDataReader. You can implement your own IDataReader as a wrapper around the StreamReader where the Read() method will consume a single line from the StreamReader. This way the data will be "streamed" into the database instead of trying to build it up in memory as a DataTable first. Hope that helps.

Mark 2010-09-28 21:21:03

Good idea, however implementing IDataReader is quite a lot of work, just because of the number of methods...

Thomas Levesque 2010-09-28 21:41:09

I admit it is a lot of work. I've implemented an EnumeratorDataReader<T> that basically adapts a Enumerator<T> to an IDataReader. This way I should never need to implement an IDataReader again, I simply implement an iterator block that parses a single line (similarly to your example Thomas) and then create an EnumeratorDataReader from that. I see no practical way to share this code through StackOverflow, or I would.

Mark 2010-09-28 21:55:39

Answer 3

+2 A:

Do you actually need to process the data by batches of rows ? Or could you process it row by row ? In the latter case, I think Linq could be very helpful here, because it makes it easy to stream data across a "pipeline" of methods. That way you don't need to load a lot of data at once, only one row at a time

First, you need to make your StreamReader enumerable. This is easily done with an extension method:

public static class TextReaderExtensions
{
    public static IEnumerable<string> Lines(this TextReader reader)
    {
        string line;
        while((line = reader.ReadLine()) != null)
        {
            yield return line;
        }
    }
}

That way you can use the StreamReader as the source for a Linq query.

Then you need a method that takes a string and converts it to a DataRow:

DataRow ParseDataRow(string input)
{
    // Your parsing logic here
    ...
}

With those elements, you can easily project each line from the file to a DataRow, and do whatever you need with it:

using (var reader = new StreamReader(fileName))
{
    var rows = reader.Lines().Select(ParseDataRow);
    foreach(DataRow row in rows)
    {
        // Do something with the DataRow
    }
}

(note that you could do something similar with a simple loop, without using Linq, but I think Linq makes the code more readable...)

Thomas Levesque 2010-09-28 21:28:02

+1. Nice example. I would probably note that you want to make sure that you **do not add the row** to the DataTable which creates it, otherwise you lose the memory benefits of a streaming solution.

LBushkin 2010-09-28 21:33:53

Great - thank you for this example, I will try it and let you know how it goes

tt2 2010-09-28 21:42:02

ansaurus

tags:

views:

answers:

Read from StreamReader in batches (C#)

related questions