ansaurus

Question

Why can't I leverage 4GB of RAM in my computer to process less than 2GB of information in C#?

Answer 1

+13 A:

Yes, String.Split creates a new String object for each "piece" - that's what it's meant to do.

Now, bear in mind that strings in .NET are Unicode (UTF-16 really), and with the object overhead the cost of a string in bytes is approximately 20 + 2*n where n is the number of characters.

That means if you've got a lot of small strings, it'll take a lot of memory compared with the size of text data involved. For example, an 80 character line split into 10 x 8 character strings will take 80 bytes in the file, but 10 * (20 + 2*8) = 360 bytes in memory - a 4.5x blow-up!

I doubt that this is a GC problem - and I'd advise you to remove extra statements setting variables to null when it's not necessary - just a problem of having too much data.

What I would suggest is that you read the file line-by-line (using TextReader.ReadLine() instead of TextReader.ReadToEnd()). Clearly having the whole file in memory if you don't need to is wasteful.

Jon Skeet 2009-04-01 07:18:39

Extremely informative answer. As MSalters suggested, it seems that I would need to represent the data a different way if I want to work with all the information at once.

exceptionerror 2009-04-01 08:04:21

Yes - although you'll still run into problems eventually. If you can work out a way of processing the data in a streaming fashion, the solution will scale a lot better.

Jon Skeet 2009-04-01 08:10:59

Would you recommend something like "push" linq so that I can extract relational information across files without looping?

exceptionerror 2009-04-01 10:38:12

It depends on exactly what you need to do, but yes, Push LINQ is great for aggregation over huge data sets.

Jon Skeet 2009-04-01 10:41:10

Answer 2

+3 A:

I would suggest reading line by line instead of the entire file, or a block of up to 1-2mb.

I wrote a library to read web server log files a few years ago and found out reading line by line wasn't the fastest way of getting the data in. Here's one method - I was parsing line by line.

Update:
From Jon's comments I was curious and experimented with 4 methods:

StreamReader.ReadLine (default and custom buffer size),
StreamReader.ReadToEnd
My method listed above.

Reading a 180mb log file:

ReadLine ms: 1937
ReadLine bigger buffer, ascii ms: 1926
ReadToEnd ms: 2151
Custom ms: 1415

The custom StreamReader was:

StreamReader streamReader = new StreamReader(fileStream, Encoding.Default, false, 16384)

StreamReader's buffer by default is 1024.

For memory consumption (the actual question!) - ~800mb used. And the method I give still uses a StringBuilder (which uses a string) so no less memory consumption.

Chris S 2009-04-01 07:23:29

Or just call TextReader.ReadLine()... that's what it's there for...

Jon Skeet 2009-04-01 07:52:59

(I'd also strongly suggest using a "using" statement to avoid leaving the stream open in case of an exception, and renaming "bytesRead" to "charactersRead".)

Jon Skeet 2009-04-01 07:55:05

I'll edit my answer as I contradict myself + update that 3 year old code with your suggestions. The 16384 buffer size was the main difference which was from a discussion on microsoft.public.dotnet.languages.csharp about C++ vs C# performance for text size.

Chris S 2009-04-01 08:47:47

TextReader and StreamReader do it byte per byte iirc which was a lot a slower when I did some tests with reading 1.5mb log files - I was parsing each line at a time too

Chris S 2009-04-01 08:48:33

I hope in your tests you rebooted between runs to clear out the file cache...

Jon Skeet 2009-04-01 10:42:50

TextReader.ReadLine() checks char by char for end-of-line characters - but it doesn't call Read for a single character at a time. (And both StreamReader and FileStream have buffers.) Btw, using File.OpenText applies a few optimizations to the FileStream that it creates, in particular (cont)

Jon Skeet 2009-04-01 10:54:41

it optimizes for sequential access. (I've been benchmarking a mixture of IO and CPU stuff recently - see http://msmvps.com/jon.skeet and the last few articles)

Jon Skeet 2009-04-01 10:55:13

This was the code Jon: http://pastebin.com/m6e8a98bd . It's rough and ready but I did get roughly the same results each time. I don't know why it's faster but my guess would be no branching on \n and \r in the loop and decoding using Decoder

Chris S 2009-04-01 11:53:08

4 separate files instead of a reboot. The ReadLine method is faster for the first 2 tries, then the time of using the char[] method decreases a lot

Chris S 2009-04-01 12:25:48

Answer 3

+2 A:

Modern GC languages take advantage of the large amounts of cheap RAM to offload memeory management tasks. This imposes a certain overhead, but your typical business app doesn't really need that much information anyway. Many programs get by with less than a thousand objects. Manually managing that many is a chore, bu even a thousand bytes per-object overhead wouldn't matter.

In your case, the per-object overhead is becoming a problem. You can for instance consider representing each column as one object, implemented with a single String and an array of integer offsets. To return a single field, you return a substring (possibly as a shim)

MSalters 2009-04-01 07:26:17

It seems that I've exhausted available C# best practices, and your answer pointed me to the next best thing. I really like C#, but I'm wondering if it would be a good idea to learn and work with C++/CLI in the future, if I encounter other data-intensive challenges like this.

exceptionerror 2009-04-01 08:25:30

Do consider native C++; it can be quite efficient in these cases. Yes, you'll have to write a lot of code for functionality that's included in C#. But that's exactly the point; you are one of the few who can't afford the .Net defaults.

MSalters 2009-04-01 09:50:26

I had a very similar problem a few years back when I experimented with a throwaway .net DB conversion utility. I couldn't make the .net one work fast, but a very simple c++ OLEDB app worked really quickly. I figured the .net lib I was using was very inefficient wrt memory.

gbjbaanb 2009-04-01 09:58:17

the difference was 10 hours down to 10 minutes (roughly). I appreciate I may have cocked it up, but I tried everything I could think of to make it work faster. Sometimes you just need a better tool for some jobs.

gbjbaanb 2009-04-01 10:00:39

ansaurus

tags:

views:

answers:

Why can't I leverage 4GB of RAM in my computer to process less than 2GB of information in C#?

related questions