views:

756

answers:

3

Scenario: over 1.5GB of text and csv files that I need to process mathematically. I tried using SQL Server Express, but loading the information, even with BULK import takes a very long time, and ideally I need to have the entire data set in memory, to reduce hard disk IO.

There are over 120,000,000 records, but even when I attempt to filter the information to just one column (in-memory), my C# console application is consuming ~3.5GB of memory to process just 125MB (700MB actually read-in) of text.

It seems that the references to the strings and string arrays are not being collected by the GC, even after setting all references to null and encapsulating IDisposables with the using keyword.

I think the culprit is the String.Split() method which is creating a new string for each comma separated value.

You may suggest that I shouldn't even read the unneeded* columns into a string array, but that misses the point: How can I place this entire data set in memory, so I can process it in parallel in C#?

I could optimize the statistical algorithms and coordinate tasks with a sophisticated scheduling algorithm, but this is something I was hoping to do before I ran into memory problems, and not because of.

I have included a full console application that simulates my environment and should help replicate the problem.

Any help is appreciated. Thanks in advance.

using System;
using System.Collections.Generic;
using System.Text;
using System.IO;

namespace InMemProcessingLeak
{
    class Program
    {
        static void Main(string[] args)
        {
            //Setup Test Environment. Uncomment Once
            //15000-20000 files would be more realistic
            //InMemoryProcessingLeak.GenerateTestDirectoryFilesAndColumns(3000, 3);
            //GC
            GC.Collect();
            //Demostrate Large Object Memory Allocation Problem (LOMAP)
            InMemoryProcessingLeak.SelectColumnFromAllFiles(3000, 2);
        }
    }

    class InMemoryProcessingLeak
    {
        public static List<string> SelectColumnFromAllFiles(int filesToSelect, int column)
        {
            List<string> allItems = new List<string>();
            int fileCount = filesToSelect;
            long fileSize, totalReadSize = 0;

            for (int i = 1; i <= fileCount; i++)
            {
                allItems.AddRange(SelectColumn(i, column, out fileSize));
                totalReadSize += fileSize;
                Console.Clear();
                Console.Out.WriteLine("Reading file {0:00000} of {1}", i, fileCount);
                Console.Out.WriteLine("Memory = {0}MB", GC.GetTotalMemory(false) / 1048576);
                Console.Out.WriteLine("Total Read = {0}MB", totalReadSize / 1048576);
            }
            Console.ReadLine();
            return allItems;

        }

        //reads a csv file and returns the values for a selected column
        private static List<string> SelectColumn(int fileNumber, int column, out long fileSize)
        {
            string fileIn;
            FileInfo file = new FileInfo(string.Format(@"MemLeakTestFiles/File{0:00000}.txt", fileNumber));
            fileSize = file.Length;
            using (System.IO.FileStream fs = file.Open(FileMode.Open, FileAccess.Read, FileShare.Read))
            {
                using (System.IO.StreamReader sr = new System.IO.StreamReader(fs))
                {
                    fileIn = sr.ReadToEnd();
                }
            }

            string[] lineDelimiter = { "\n" };
            string[] allLines = fileIn.Split(lineDelimiter, StringSplitOptions.None);

            List<string> processedColumn = new List<string>();

            string current;
            for (int i = 0; i < allLines.Length - 1; i++)
            {
                current = GetColumnFromProcessedRow(allLines[i], column);
                processedColumn.Add(current);
            }

            for (int i = 0; i < lineDelimiter.Length; i++) //GC
            {
                lineDelimiter[i] = null;
            }
            lineDelimiter = null;

            for (int i = 0; i < allLines.Length; i++) //GC
            {
                allLines[i] = null;
            }
            allLines = null;
            current = null;

            return processedColumn;
        }

        //returns a row value from the selected comma separated string and column position
        private static string GetColumnFromProcessedRow(string line, int columnPosition)
        {
            string[] entireRow = line.Split(",".ToCharArray());
            string currentColumn = entireRow[columnPosition];
            //GC
            for (int i = 0; i < entireRow.Length; i++)
            {
                entireRow[i] = null;
            }
            entireRow = null;
            return currentColumn;
        }

        #region Generators
        public static void GenerateTestDirectoryFilesAndColumns(int filesToGenerate, int columnsToGenerate)
        {
            DirectoryInfo dirInfo = new DirectoryInfo("MemLeakTestFiles");
            if (!dirInfo.Exists)
            {
                dirInfo.Create();
            }
            Random seed = new Random();

            string[] columns = new string[columnsToGenerate];

            StringBuilder sb = new StringBuilder();
            for (int i = 1; i <= filesToGenerate; i++)
            {
                int rows = seed.Next(10, 8000);
                for (int j = 0; j < rows; j++)
                {
                    sb.Append(GenerateRow(seed, columnsToGenerate));
                }
                using (TextWriter tw = new StreamWriter(String.Format(@"{0}/File{1:00000}.txt", dirInfo, i)))
                {
                    tw.Write(sb.ToString());
                    tw.Flush();
                }
                sb.Remove(0, sb.Length);
                Console.Clear();
                Console.Out.WriteLine("Generating file {0:00000} of {1}", i, filesToGenerate);
            }
        }

        private static string GenerateString(Random seed)
        {
            StringBuilder sb = new StringBuilder();
            int characters = seed.Next(4, 12);
            for (int i = 0; i < characters; i++)
            {
                sb.Append(Convert.ToChar(Convert.ToInt32(Math.Floor(26 * seed.NextDouble() + 65))));
            }
            return sb.ToString();
        }

        private static string GenerateRow(Random seed, int columnsToGenerate)
        {
            StringBuilder sb = new StringBuilder();

            sb.Append(seed.Next());
            for (int i = 0; i < columnsToGenerate - 1; i++)
            {
                sb.Append(",");
                sb.Append(GenerateString(seed));
            }
            sb.Append("\n");

            return sb.ToString();
        }
        #endregion
    }
}

*These other columns will be needed and accessed both sequentially and randomly through the life of the program, so reading from disk each time is a tremendously taxing overhead.

**Environment Notes: 4GB of DDR2 SDRAM 800, Core 2 Duo 2.5Ghz, .NET Runtime 3.5 SP1, Vista 64.

+13  A: 

Yes, String.Split creates a new String object for each "piece" - that's what it's meant to do.

Now, bear in mind that strings in .NET are Unicode (UTF-16 really), and with the object overhead the cost of a string in bytes is approximately 20 + 2*n where n is the number of characters.

That means if you've got a lot of small strings, it'll take a lot of memory compared with the size of text data involved. For example, an 80 character line split into 10 x 8 character strings will take 80 bytes in the file, but 10 * (20 + 2*8) = 360 bytes in memory - a 4.5x blow-up!

I doubt that this is a GC problem - and I'd advise you to remove extra statements setting variables to null when it's not necessary - just a problem of having too much data.

What I would suggest is that you read the file line-by-line (using TextReader.ReadLine() instead of TextReader.ReadToEnd()). Clearly having the whole file in memory if you don't need to is wasteful.

Jon Skeet
Extremely informative answer. As MSalters suggested, it seems that I would need to represent the data a different way if I want to work with all the information at once.
exceptionerror
Yes - although you'll still run into problems eventually. If you can work out a way of processing the data in a streaming fashion, the solution will scale a lot better.
Jon Skeet
Would you recommend something like "push" linq so that I can extract relational information across files without looping?
exceptionerror
It depends on exactly what you need to do, but yes, Push LINQ is great for aggregation over huge data sets.
Jon Skeet
+3  A: 

I would suggest reading line by line instead of the entire file, or a block of up to 1-2mb.

I wrote a library to read web server log files a few years ago and found out reading line by line wasn't the fastest way of getting the data in. Here's one method - I was parsing line by line.

Update:
From Jon's comments I was curious and experimented with 4 methods:

  • StreamReader.ReadLine (default and custom buffer size),
  • StreamReader.ReadToEnd
  • My method listed above.

Reading a 180mb log file:

  • ReadLine ms: 1937
  • ReadLine bigger buffer, ascii ms: 1926
  • ReadToEnd ms: 2151
  • Custom ms: 1415

The custom StreamReader was:

StreamReader streamReader = new StreamReader(fileStream, Encoding.Default, false, 16384)

StreamReader's buffer by default is 1024.

For memory consumption (the actual question!) - ~800mb used. And the method I give still uses a StringBuilder (which uses a string) so no less memory consumption.

Chris S
Or just call TextReader.ReadLine()... that's what it's there for...
Jon Skeet
(I'd also strongly suggest using a "using" statement to avoid leaving the stream open in case of an exception, and renaming "bytesRead" to "charactersRead".)
Jon Skeet
I'll edit my answer as I contradict myself + update that 3 year old code with your suggestions. The 16384 buffer size was the main difference which was from a discussion on microsoft.public.dotnet.languages.csharp about C++ vs C# performance for text size.
Chris S
TextReader and StreamReader do it byte per byte iirc which was a lot a slower when I did some tests with reading 1.5mb log files - I was parsing each line at a time too
Chris S
I hope in your tests you rebooted between runs to clear out the file cache...
Jon Skeet
TextReader.ReadLine() checks char by char for end-of-line characters - but it doesn't call Read for a single character at a time. (And both StreamReader and FileStream have buffers.) Btw, using File.OpenText applies a few optimizations to the FileStream that it creates, in particular (cont)
Jon Skeet
it optimizes for sequential access. (I've been benchmarking a mixture of IO and CPU stuff recently - see http://msmvps.com/jon.skeet and the last few articles)
Jon Skeet
This was the code Jon: http://pastebin.com/m6e8a98bd . It's rough and ready but I did get roughly the same results each time. I don't know why it's faster but my guess would be no branching on \n and \r in the loop and decoding using Decoder
Chris S
4 separate files instead of a reboot. The ReadLine method is faster for the first 2 tries, then the time of using the char[] method decreases a lot
Chris S
+2  A: 

Modern GC languages take advantage of the large amounts of cheap RAM to offload memeory management tasks. This imposes a certain overhead, but your typical business app doesn't really need that much information anyway. Many programs get by with less than a thousand objects. Manually managing that many is a chore, bu even a thousand bytes per-object overhead wouldn't matter.

In your case, the per-object overhead is becoming a problem. You can for instance consider representing each column as one object, implemented with a single String and an array of integer offsets. To return a single field, you return a substring (possibly as a shim)

MSalters
It seems that I've exhausted available C# best practices, and your answer pointed me to the next best thing. I really like C#, but I'm wondering if it would be a good idea to learn and work with C++/CLI in the future, if I encounter other data-intensive challenges like this.
exceptionerror
Do consider native C++; it can be quite efficient in these cases. Yes, you'll have to write a lot of code for functionality that's included in C#. But that's exactly the point; you are one of the few who can't afford the .Net defaults.
MSalters
I had a very similar problem a few years back when I experimented with a throwaway .net DB conversion utility. I couldn't make the .net one work fast, but a very simple c++ OLEDB app worked really quickly. I figured the .net lib I was using was very inefficient wrt memory.
gbjbaanb
the difference was 10 hours down to 10 minutes (roughly). I appreciate I may have cocked it up, but I tried everything I could think of to make it work faster. Sometimes you just need a better tool for some jobs.
gbjbaanb